Hacker News new | past | comments | ask | show | jobs | submit login
Byte Latent Transformer: Patches Scale Better Than Tokens (meta.com)
378 points by zxexz 48 days ago | hide | past | favorite | 84 comments




The summer that BERT came out I was working at a startup that was using character-based CNN models for classification. We were thinking a lot about alternate representations, other members of the team were keen on word vectors but I wasn't, particularly because it seemed the documents were were working on frequently had out-of-dictionary words, because those words were important, and because discarding them would lead to failure.

(We were working on "foundation models" too, so it's not just being out-of-dictionary in the final model that's a problem but being out-of-dictionary in the foundation model which is more expensive to train.)

We were doing OK with character based models for classification but people believed that storing the "dictionary" inside the neural net was not a good use of the neural net so there was a lot of enthusiasm for tokens.

Meanwhile I felt so sure that schemes like Word2Vec were doomed that I had left an earlier project using RNNs where the goal was text understanding with a foundation model made by training an RNN to write fake abstracts for case reports from PubMed.

When byte-pair encoding was introduced I remember telling people in a meeting that it was the first tokenization scheme we'd looked at that I could endorse.

I have to admit though that I wish we could work at the character label.


I was really excited for CANINE [1] but it never really went anywhere. Tokens are a hack. They work for the most part, but it’s clear when they don’t.

[1] https://arxiv.org/abs/2103.06874


Do you mean that all produced output must be a chain or words found in a dictionary?

The real-world for humans has them creating and using non-dictionary words to communicate daily. A good example is "notify", defined in the dictionary. "notifier", which is not and is used to describe "a means to notify someone". The code to send an email notification is an "email notifier", then there is text message, voice call, call center call back notifiers ....

All industries and organizations have jargon, custom defined words not found in a dictionary and use non distinctive acronyms.

How would a ML output be useful if it cannot handle real world commutation and only lab based sanitization of in-dictionary only responses?


(Author here)

If I understand your question right, this is one of the reasons BPE is nice and the parent liked it. For any character sequence, provided the characters are in the alphabet used to create the BPE vocab, there are no unknown words/sequences. One downside of some previous tokenization methods is you could have unknown/UNK tokens, EG dictionary based methods.

In our paper with bytes, we also avoid the UNK issue, since we can have an embedding for every possible byte, since it’s not that many (and for sequences of bytes we use hash embedding, although we did test n-gram lookups for the top K frequent byte n-grams in the training data).


Nice work. Thank you for commenting on HN!

Did you guys try using an RNN or some other kind of DNN to encode the patches?


I don't believe so, or at least if someone tried it didn't work well enough that I remember :). Some of the motivation for the architecture changes in encoding patches stemmed from finding FLOP efficient ways to express relationships between byte sequences. E.G., having a long context window makes sense when dealing with tokens, but you don't need as long as an attention window if you're attending byte sequences to make patch representations, since the patch representations will implicitly be part of a longer context window in terms of number of patches.


Thanks for the quick reply!

Interesting. I would have thought one of those "minimum viable" RNNs (like https://arxiv.org/abs/2410.01201) would have been ideal for this. I might tinker a bit with this :-)


That's the OP's point. At the time, the community was split between word-level, which has the shortcomings you're describing, and byte-level which is uselessly compute intensive. BPE was the first reasonable in-between. BLT improves on BPE by having the the compression learnable rather than precomputed


I really hope this works out. Death to tokenizers!

Interesting that it's a hierarchical structure but only two levels of hierarchy. Stacking more levels seems like an obvious direction for further research.

Note: I posted this comment on another related story[1] and the author replied:

"Author here :), I do think it’s a good direction to look into! That said, aside from it being a bit too much to do at once, you’d also have to be careful about how you distributed your FLOP budget across the hierarchy. With two levels, you can make one level (bytes/local encoder) FLOP efficient and the other (patches/global encoder) FLOP intensive. You’d also need to find a way to group patches into larger units. But ya, there are many directions to go from here!"

[1] https://news.ycombinator.com/item?id=42413430


Agree more levels seems like it could be beneficial. And another Meta paper published a day later shows how that might work: https://ai.meta.com/research/publications/large-concept-mode...


To create a patch, a small model is used to predict the likelihood for the next character in the input string. Input string: 'Lazy dog jumped over a fence.' Use the model to predict the likelihood of each character.

For example:

    100% sure the next character is 'a'.
    Or maybe it's 10% sure it's 'a', 10% sure it's 'b', and so on.
Then we chunk character estimates together. How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same. And there you have your 'patch' (or 'token').


> How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same.

That's not how it's described in Section 2.3 of the paper. They only use the entropy of the next byte and whether it exceeds a threshold (Global Constraint) or is larger than the preceding byte's entropy by another threshold (Approx. Monotonic Constraint).

That does mean that long repetitive sequences can result in pathologically long patches, as demonstrated in Appendix E.

But what I'm really curious about is the "small CNN byte-level model with 2-byte context" in Figure 3 (f), because it's never mentioned in any other part of the paper.


(Author Here)

Good description! Maybe what parent got mixed up on is an alternate way to view this is trying to chunk bytes to have roughly similar information. EG we initially tried a bunch of patching schemes, EG, keep a running total of entropy until the total exceeds a threshold, but ended up finding simple things worked better.

I’ll see if we can add more information about the small CNN in a next update to arXiv paper.


I'm curious if you're aware of some papers from around 2005 on using contextual entropy to do unsupervised word segmentation on Chinese, and other languages that don't use spaces for word boundaries.

https://aclanthology.org/Y03-1017/ https://aclanthology.org/I05-1009/ https://aclanthology.org/P06-2056/

Exactly the same approach of segmenting a word when the entropy goes up compared to the previous byte.


It is also quite similar to Carl de Marcken's work for segmenting text and speech. He phrased everything in terms of minimum description length (MDL), but that is trivially the same thing as local entropy.

https://dspace.mit.edu/handle/1721.1/7191?show=full


At least I wasn't aware of this work, but thanks for the refs! I'm always curious to read papers from 10-20+ years ago that have similarly inspired ideas. If it makes sense, we'll mention those in the next related work update.


One way of thinking about the "Approximate Monotonic Constraint" is that you're running a quick and dirty edge detector on the entropy. Ie, you're clipping based on the gradient of per-byte entropy wrt timestep compared to detecting an edge based on gradient of per-pixel intensity wrt pixel coordinates. It would be interesting to look at the raw sequences of per-byte entropies to see how strongly these sorts of "edges" correlate with human interpretable boundaries (words, prefixes, suffixes, etc).


Figure 4 plots the entropy of each byte in "Daenerys Targeryen is in Game of Thrones, a fantasy epic by George R.R. Martin."


"That's not how it's described" - Thanks for the correction!


So a variant might be to try using a some standard compression algorithm to train with?


Recent and related:

Sharing new research, models, and datasets from Meta FAIR - https://news.ycombinator.com/item?id=42412360 - Dec 2024 (61 comments)


So only thing teaching model (loss) is probability prediction in single byte space. And that is enough? Looks very promising, if I am not misunderstanding.


From my understanding this not only removes tokenization but also sampling correct?

Sampling can be a pain point of LLMs, but they also can enable interesting usages, like forcing grammar so the model always outputs valid JSON or tuning temperature to get more varied distribution, XTC sampling, etc.

What would be the equivalent of these in a BLT?

I can only think of providing the decoder an extra input of allowed/prohibited bytes and run the decode over and over until it outputs something valid, maybe there's a simpler and more obvious approach.


It doesn't remove sampling, and forcing grammar by specifying allowed/prohibited bytes doesn't require running the decoder over and over, you just compute the softmax at the output layer over allowed bytes only and sample from those accordingly, same as with BPE-based models.


Does this mean AI can pre-train on binaries?


Some believe AI can now output compiled binaries (e.g update Notepad.exe with this feature).

We all think AI writing code for us will be the end, but it might be an even simpler take over.


That just sounds worse though? We can't validate the change is correct if we can't read the code. It is interesting though


Idk what they mean, I've never seen anyone claim, or come close to claiming, it could alter an executable binary by itself. Chose to interpret it as "some people think an llm can code well enough to add features on top of a ~50KLOC codebase automatically"


I think he's saying people believe a LLM trained with this architecture would be able to do something like that.


at some point you can't or won't be allowed to do any validations


I find it interesting how far linguistic, and experienced based approaches have fallen out of fashion. Humans don't read character by character, even if we can it's not a standard operating mode. We have word stems and understand modifications by endings. Tokenization doesn't replicate this experience (seriously, look at the tokens that appear in LLM vocabularies), nor does character or byte encoding. Humans have multiple ways to parse words. You can grok a full sentence, read a phrase, read word by word, or sound out a new word character by character. Very few papers explicitly claim that a method is good because it replicates the way a human would perform a task, or perceive the world.

I suspect as LLM reliance increases we'll want to align the models to our experience more closely. I further suspect this will make the errors that models make more comprehensible.


> Unlike tokenization, BLT has no fixed vocabulary for patches.

iiuc this means: the vocabulary of patches is not known prior to training.

I guess once training has established a vocabulary of patches, that same fixed vocabulary is used for inference (if this is not true I don't see how it could work).

Right?


An interesting read on alternative tokenization methods.

Questions:

1. What's the goal of entropy based byte token grouping as tokenization? Is this tokenization method best suited for the goal?

2. What about simply using byte level sequence to sequence autoencoder with down sampling for tokenization?


This is neat work, but I also love the (presumably intentional?) backronym of BLT.



Interesting, this is one of the worst NotebookLM examples I've seen so far. They are interjecting way too often and breaking the rhythm. Is generation quality going down due to the popularity of the service?


Big successful launch, hype for the product lead, product lead moves on, product goes to shit. Another classic for the Google graveyard.


We are working directly with the Notebook team from the outside, and while they have lost the original product lead, the team in general is seemingly really well supported, staffed with talented folks, and actively trying to understand what the end user wants from the product. Hardly a day goes by that they are not actively trying to get more feedback and share where they are heading.

I do think it is fair to say they had been caught off guard by the success of the program and are trying to catch up. Maybe this is just a bit of drift as they are figuring it all out? Or maybe I am too charitable.


> and while they have lost the original product lead

Doesn't matter how talented the team is, that is a massive red flag as not even six months have passed since the launch.


Core team just moved on to something else: https://werebuilding.ai/


That has to be the worst landing page ever


no comment; not my site, just sharing


Uh, the product manager left last week.


Yeah, super strange. One cannot finish a sentence without the other interjecting.


People like this?


Why can't the tokenization be implicit, so we only feed bytes (or characters) to the model?


(Author Here)

Not sure what you mean by implicit? If you mean just treat bytes as tokens, one issue you run into is your sequence lengths get quite long, so compared to a regular token LLM, you can’t pack as many bytes in a batch, which means you’re pretty FLOP inefficient so scale worse. You could make the model smaller to compensate, but then the model isn’t as good.


It can work, but you have more tokens / weaker performance.

People tested it and it was worse.


I wonder whether llama 4 will use this


Related quote from Karpathy:

Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.

• Why can't LLM spell words? Tokenization.

• Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.

• Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

• Why is LLM bad at simple arithmetic? Tokenization.

• Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

• Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.

• What is this weird warning I get about a "trailing whitespace"? Tokenization.

• Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.

• Why should I prefer to use YAML over JSON with LLMs? Tokenization.

• Why is LLM not actually end-to-end language modeling? Tokenization.

• What is the real root of suffering? Tokenization.


It’s weird because I’m pretty sure my brain does something similar when I speed read. I don’t actually, usually, read the words; instead I recognize the shape of the words (most common words) then I jump to the subject of the paragraphs and break down the meaning of the whole page in a second or so.


(Author Here)

In editing we couldn’t find a good place for this so cut it in the current version, but at one point had discussed a parallel with information density of speech as described by one paper. Essentially the paper found that in languages that were less information dense per syllable, speakers spoke faster to achieve similar information density as languages with higher density per syllable. You could see patching by entropy paralleling this if you consider that low entropy bytes in terms of Shannon entropy are less information dense.


That's generally true, but you also have the ability to stop and look closer if you want to. If someone asks you to count the letters in a word, you will stop to look at the letters individually. If you see an unfamiliar word like SolidGoldMagikarp, you can stop and break it apart. Tokenization prevents LLMs from doing this.


Generally the current crop of LLMs seem pretty good analogues of the "scan reading" immediate instinctual response to stimulus, but seems to completely lack the higher level that can then go "Wait, that doesn't seem right, let's go back over that again". Like hallucinations and seeing "Faces" in dark shadows until you look again, it's like it's doing a pretty good emulation of some level of consciousness.

Is that a fundamental difference to the level of processing? I haven't seen that sort of second-tier logic pop up from any emergence behaviors from increasing scale yet, but will that come with time? I'm not sure.


You can prompt the model to do that kind of "stream of mind" process. It will maximize modeling uncertainty. This is my prompt:

> Write in a raw, real-time stream-of-consciousness style, as if actively solving a problem. Your response should feel like unpolished notes—messy, exploratory, and authentic. Show your full thought process, including missteps, dead ends, and course corrections. Use markers to signal mental states: Insights: "Wait -", "Hold on -", "Oh -", "Suddenly seeing -", "This connects to -". Testing: "Testing with -", "Breaking this down -", "Running an example -", "Checking if -". Problems: "Stuck on -", "This doesn’t work because -", "Need to figure out -", "Not quite adding up -". Progress: "Making headway -", "Starting to see the pattern -", "Explains why -", "Now it makes sense -". Process: "Tracing the logic -", "Following this thread -", "Unpacking this idea -", "Exploring implications -". Uncertainty: "Maybe -", "Could be -", "Not sure yet -", "Might explain -". Transitions: "This leads to -", "Which means -", "Building on that -", "Connecting back to -". Lean into real-time realizations: "Wait, that won't work because…" or "Ah, I missed this…" Show evolving understanding through short paragraphs, with natural pauses where ideas shift. Structure your thought evolution as follows: Begin with an initial take: "This might work because…" or "At first glance…" Identify problems or angles: "Actually, this doesn’t hold up because…" Test examples or counterexamples: "Let me try -", "What happens if -". Seek deeper patterns: "I’m seeing a connection -", "This ties back to -". Link broader implications: "This means -", "If this holds, then -". Admit confusion openly: "I don’t get this yet", "Something’s missing here". Reveal partial understanding: "I see why X, but not Y". Show failures and iterations: "Still not right - trying another approach". Embrace a debugging mindset, treating ideas like code—break them into steps, test logic, reveal failure modes, and iterate. Skip introductions and conclusions. Stop when you solve the problem or find clear next steps. Use short, direct sentences to mimic real-time thinking. The goal is to capture the messy, evolving nature of problem-solving and thought refinement.

Just try this, you can insert at any point in a LLM chat session. I built it by reverse engineering the QwQ-32B model responses with Claude. QwQ itself is based on the GPT-o1 method.


FWIW this gave more entertaining but ultimately worse results than without on Claude for me, using the prompt:

> How many chickens can fit on a 747?


I've tried prompts like this with Claude, but it can get so nitpicky of itself that it runs out of space for the actual answer. It seems it does help to train the model to do it.


I've often wanted to talk with an LLM about its tokenization (e.g. how many tokens are there in "the simplest of phrases") I wonder if you fed it information about its tokenization (text like "rabbit is spelled r, a, b, b, i, t") if it could talk about it.


Well said!!

I’m waiting for reading studies on AI generated text, that’s a different kind of speed read


Meta's approach doesn't seem to throw out character grouping entirely, it just makes it dynamic.


Goodbye tokenization problems, hello encoding problems!


!Long post warning!

Tokenization is often scapegoated for many transformer limitations. I suppose it's because reading about the many limitations of the transformer architecture is harder than dumping everything on tokenization (which to be fair, is often indirectly involved with or exacerbating some deeper issue).

> Why can't LLM spell words? Tokenization.

LLMs can spell if you ask them to though. And there have been investigations into this capability (ref:2). Tokenization makes computations that involve spelling more difficult, but this is downstream of deeper computational limitations of the architecture.

> Why can't LLM do super simple string processing tasks like reversing a string?

Ditto.

> Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

Tokenization is also implicitly performing compression. If your tokenizer's corpus is focused only on english, basic information theory explains why it'll be less efficient for other languages. The net effect is longer sequences where tokens are less information dense for non-english languages on average.

> Why is LLM bad at simple arithmetic? Tokenization.

Tokenization could treat digits separately and I believe, llama2 did this. But OpenAI built tiktoken which does not do this. llama3 uses tiktoken.

The transformer architecture also has limitations that make (default) arithmetic computations involving carries difficult to learn. You can read more about this in (ref:1).

> Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.

Why should it not? Either way, it doesn't have to halt, as the sampler can just ignore this. But the distribution will still condition on this as a change of topic switch. The question should probably be, why did the LLM suddenly assign high probability to a stop token before finishing whatever it was writing?

> What is this weird warning I get about a "trailing whitespace"? Tokenization.

Modeling decisions for how to treat whitespace is upstream of tokenization. These choices affect how the LLM models word boundaries. Things can be fine most of the time until they aren't.

There's also the issue of softmax. The way softmax is typically applied forces the model to always assign importance to some tokens, even when no strong relationships exist between them. This in turn leads to the model disproportionately dumping its focus on often semantically unimportant tokens like whitespace or punctuation. Misallocating attention in this manner can lead to wasting representational capacity due to overemphasizing unimportant tokens, perhaps inducing spurious correlations on whitespace. This issue propagates through the model, possibly leading to unexpected negative downstream effects.

> Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.

One step down, it's really a result of high dimensional random vectors.

> Why should I prefer to use YAML over JSON with LLMs? Tokenization.

> Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

Tokenization does make counting more difficult but the net benefit to programming languages where whitespace can be semantically meaningful is a strong positive. Even when whitespace is not meaningful, long strings of them can often be encountered. Not being careful about devoting tokenization effort on whitespace will significantly degrade code modeling ability in LLMs.

> Why is LLM not actually end-to-end language modeling? Tokenization.

This is correct, but it is not necessarily the case that a character or byte based model will automatically be better. The issue is that LLMs as currently devised spend the same amount of computation per token. This creates the immediate problem of making meaningful sequences, which will now be substantially longer, substantially more expensive to compute, generate and store in memory. This is what the posted paper seeks to address over naive byte level modeling. Although it's unclear from the provided tables if what's claimed is actually what's occurring.

Character level modeling will also make learning long ranged dependencies harder. Subword tokenization also aids in memorization, which can be useful in learning from the tail of the distribution. The following idea is based on (ref:5).

Next-token prediction can be modeled as a hierarchical sampling process where problem instances (topics, natural language tasks), which are mixture distributions, are drawn from a metadistribution, and then data points (eg various strings) are sampled from specific subpopulations (ie clusters of task types) within those instances. Here, memorization is a key strategy since there's initial uncertainty about which features are relevant for predicting the next token. Particularly for rare examples, memorizing their details acts as a starting point for associating particular patterns with specific subpopulations, in turn allowing more accurate prediction of new points.

From that starting point, the model can eventually refine its associations as it encounters more data. This is key for example, when sampling from the tail of the distribution where data about subpopulations will be more limited. Making memorization and learning longer dependencies more challenging can lead to final models that face more difficulty during ICL inference, which depends, among other things, on the ability to infer which task from a mixture distribution.

> What is the real root of suffering? Tokenization.

A better candidate is over-generalization.

1: https://arxiv.org/abs/2310.16028

2: What do tokens know about their characters and how do they know it? (https://aclanthology.org/2022.naacl- main.179.pdf)

3: https://arxiv.org/abs/2406.10851

4: Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP (https://arxiv.org/abs/2112.10508)

5: https://arxiv.org/abs/2012.06421


In all seriousness: why has it been years now and it feels like there is no incremental engineering-level progress on these issues? Like, it seems like doing some manual intervention to the tokenization to at least remove exceptional tokens and add some semantics to how they break up numbers seem like quick wins.


(Author Here)

There is at least some work on character based modeling, but it hasn’t scaled well before. The challenge I think with something more adhoc for exceptional tokens is that it’s hard to see gains since they are by definition, infrequent. If the text is rare enough, BPE should produce many single byte tokens, so current models actually expend more compute on these rare sequences.

BLT scales well because it expends less compute (by patching) on more predictable (low entropy) byte sequences. Current models only to some degree get this benefit, if it’s a larger BPE token, but that only goes so far.

So it’s really two related, but different motivations.


>In all seriousness: why has it been years now and it feels like there is no incremental engineering-level progress on these issues?

From where I'm standing, LLMs appear to be the fastest moving technological field in history.


A field can seem to be going quickly and going nowhere at the same time. Or rather a new technique can be invented and then exhausted in the time it takes somebody to get a PhD. (See https://en.wikipedia.org/wiki/Renormalization_group applied to phase transitions, which turned up just in time for the physics job crisis of 1970)

I didn't ever believe that there was going to be a GPT-5 trained with exponentially more text and resources. Not only is there not enough text, but that's the path to ruin. Why?

Cycle time. Two years ago we had little idea of how those models work so I knew there was a huge room in improving performance. It gets the cost down, it lets you put the models on your device, and it speeds up development. If I can train 10 models in the time it takes you to train 1 model I can make much faster progress.

However even a GPT-15 trained with a Dyson sphere is going to struggle to sort things. (Structurally a pure LLM can't do that!) My #1 beef with Microsoft's Copilot is that you can ask it if it can sort a certain list of items (either a list you are discussing with it or say "states of the United States ordered by percent water area") it will say yes and if you ask it what it thinks the probability is that it will get it in the right order it will say "very high" but when you try it the list comes out totally wrong.

It is equally unable to "help me make an atom bomb" except in the bomb case it will say that it can't but in the sorting case it says it can.

The obvious answer is that it should use tools to sort. That's right but the problem of "knowing what you can really do with your tools" is philosophically challenged. (With problems so intractable it leads people like Roger Penrose to conclude "I couldn't do math if I wasn't a thetan")


I'm not really sure I understand your sorting example, maybe try it out in gpt and post the link to show exactly what you mean.

The refusal of the model is something trained into the model by the process of rlhf, and it can also be untrained, by the process of abliteration [1].

Also, LLMs are capable of using tools in this very moment [2].

[1]: https://huggingface.co/blog/mlabonne/abliteration [2]: https://www.anthropic.com/news/analysis-tool


I'm deliberately blurring refusal with having an accurate picture of its own abilities and, past that, having an accurate picture of of what it can do given tools. Both are tested by

   "Can you X?"
With refusal you find just how shallow it is because it really will answer all sorts of questions that are "helpful" in making a nuclear bomb but when you ask it directly it shuts up. In another sense nothing it does is "helpful" because it's not going to hunt down some people in central asia who have 50kg of U235 burning a hole in their pocket for you, which is what would actually "help".

I use tool using LLMs frequently, but I find they frequently need help using their tools, it is a lot of fun to talk to Windsurf about the struggles it has with its tools and it feels strangely satisfying to help it out.


You totally ignored "on these issues" and are essentially saying there is no need to work on that as they worked on something else, which is extremely strange for a thing which feels like a really trivial win, and should be shocking.

Whether you like it or not, it is entirely fair to look at an entire ecosystem and ask why some trivial thing that everyone talks about all the time hasn't seen any attention even if the entire ecosystem is getting widespread advancement.

Like, I think it would also be fair to complain about how bad the hinge on AirPods are, causing the case to explode when dropped and your earbuds to fly everywhere (potentially getting very dirty) as well as wear out and cause spurious activation (leading to audio routing issues and rapid battery drain).

To then point out that this is one of the most successful consumer devices in recent years and was a remarkable improvement to what came before as well as a continuing achievement of engineering as they do in fact get better in amazing ways every couple years is more than just a non sequitur: it is frankly just annoying.


My notes:

It's a 3 component model.

- Encoder: Takes byte groupings and outputs a hidden state/encoding called patches

- Transformer: Takes these encodings of patches in autoregressive fashion

- Decoder: Takes processed encodings by transformers and outputs bytes

Loss is on byte to byte crossentropy (Next byte prediction)

How they group bytes.

- Use entropy thresholds: If a sequence of bytes have entropy lower than a threshold, group them

- This is a learned model (from data)

Why this helps over current byte-pair tokenization in LLMs.

- Encoder/decoder essentially act as “learnable” tokenization scheme

- Better efficiency tradeoffs (as for highly predictable sequence of bytes, encoder can “offload” computation effort from the main transformer)

- History teaches us that end to end learned system beats human designed mechanisms


> History teaches us that end to end learned system beats human designed mechanisms

I think this may need some qualifiers

Even byte representations are human designed encodings. I would think a human designed decoder of such encodings must be more efficient than learning. Sure bytes encoding a stream of unicode code points maps fairly easy to useful information. But bytes representing a zip compressed collection of PDF files?

I did wonder though, training on text encoding vs pixel encoding, perhaps brute forcing OCR, like humans, will be more flexible in the end then being limited to text encodings.


>Even byte representations are human designed encodings

The point is that it can model any sequence of bytes. It's what-follows-what that matters, not how we're encoding it.


> History teaches us that end to end learned system beats human designed mechanisms

Depends how far back you go. History teaches us that everything is a trade off between model size, inference time, training time, and training data size, Once you're at the pareto frontier. And that cheap approximations can allow you to trade for more expensive computation elsewhere.

That lesson has been obscured for the last decade because (1) "the bitter lesson" of scaling, and, (2), we're blowing past benchmarks too quickly.

I do agree that learned models are better if they're free (compare the distribution of filter banks learned by a neutral acoustic model to those approximated by mel frequency cepstral coefficients), but once you start hitting scaling limits, cheap heuristics start creeping back in.

BPE was a huge advancement over fixed vocab, e.g.


(Author Here)

Related thought, I think BPE is quite a good, cheap inductive bias to have in a model, which is part of what made it challenging to scale better against. I also suspect this is part of why with less training FLOPs BPE is better (left side of figure 1), BLT has to expend some of its FLOPs budget to recover/learn some of this useful bias. With more training FLOPs this becomes a smaller fraction of the budget though leading to better scaling.


I thought we’re supposed to be plateauing!?


We are. Plateauing doesnt mean you don't book progress. Arguably that is what you would call "plateaud".

The argument of plateauing is not that AI is fundamentally impossible. The argument is that just dumping more data and more compute on the problem, using the same approach, has diminishing returns.

It's that statistical inference is not how the human mind works (not exclusively) and thus that we are not guaranteed to be able to replicate all traits of human intelligence by brute forcing.

Of course we can and will still improve the algorithms. But the question remains whether tweaks like these, as cool and useful they may be to solve certain issues, will be enough by themselves.

Since it remains statistical in nature, my position is "no".


> that we are not guaranteed to be able to replicate all traits of human intelligence by brute forcing.

We know from complexity theory that transformers with chain of thought are guaranteed to be able to reproduce a significant fraction of human reasoning, anything in the complexity class PTIME: https://arxiv.org/abs/2310.07923


I don’t think this paper says what you claim. It says chain of reasoning and its length can improve transformer performance. Not that this represent a significant fraction of human reasoning or that it’s even reasoning


It says it can represent all programs in P, i.e. all reasoning that produces an output in no more than polynomial time with respect to the amount of inputs. Most human reasoning is presumably in P, not NP, because we generally don't find ourselves needing to think exponentially long about things.


who's "we"?


I am gonna read this paper and the other latent sentence later today. I always advocated for this kind of solutions together with latent sentence search should get to the next level of AI. Amazing work from Meta


Sentence thing being this one? https://ai.meta.com/research/publications/large-concept-mode...

I don’t get it, isn’t this concept modelling exactly whats going on in the deeper layers of current LLMs?


Perhaps it does some similar grouping of content, but this more directly incentivizes longer term gripping of tokens into abstract concepts. I agree that it's not obvious this would perform better than letting the model build it's own structures for grouping tokens, but the proof is in the pudding; the technique led to improved results for a given model & training size. This newer approach gives the model the freedom to build it's own breakpoints, but still bakes the idea into the algorithm itself.

What it means is a harder question. Perhaps transformers are simply an inefficient computational structure for this process? Perhaps a more flexible computational structure would integrate this step more efficiently? Perhaps Transformers are efficient enough, but our learning/densifying isn't? Or perhaps it's such a core powerful step that it might as well be built into the algo regardless? Much to learn.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: