Replicating GPT-2 at Home

minimaxir · on Jan 23, 2021

As someone who maintains a package to both make it easy to fine-tune GPT-2 or create your own from scratch (https://github.com/minimaxir/aitextgen), this submission is a good run-through of the technical considerations toward building a GPT-2 model.

It's both substantially easier and faster than it was when OpenAI released their paper in 2019, thanks to both Huggingface Transformers and Tokenizers making the architectures more efficient and other companies streamlining the training process and make it more efficient for all parts in the pipeline.

You don't need a TPU cluster to train a working GPT-2 model, although it helps (unfortunately TPU support on PyTorch-based training like aitextgen is more fussy). A free GPU on Colab gets you most of the way, especially since you can get now a T4 or a V100 which lets you use FP16.

bkkaggle · on Jan 23, 2021

Yep i started off with trying to get it to work with pytorch (https://github.com/bkkaggle/lm-training-research-project/blo...) then with pt-lightning but the whole 1 user VM per TPU board limitation in pytorch-xla 7-8 months ago made me switch over to TF

punnerud · on Jan 23, 2021

Just as Google want you to do. Within 3-5 years you will probably see a high price increase and no where to go.

bkkaggle · on Jan 23, 2021

heh. I've been using jax for a couple of months and its been a pretty nice replacement of both pt and tf. it feels like what a ml framework would look like if it were built around easy scaling and dev friendliness.

Voloskaya · on Jan 24, 2021

> You don't need a TPU cluster to train a working GPT-2 model [...] A free GPU on Colab gets you most of the way

I have a hard time believing you can really train it with 1 V-100, unless you are talking about an extremely scale down version of GPT-2 (large).

If you can train it at all it would be with a batch size so small (probably 1?) that it would hurt the performance and it would take months.

I am out of the loop somehow?

Edit: I was thinking about reproducing the training that OpenAI did in their paper, so redoing all the pre-training, but I realized you might have been talking about training on a smaller custom dataset.

make3 · on Jan 24, 2021

also, he just be talking about training a much smaller model than the 1.5B one, because that would take years maybe otherwise

bravura · on Jan 23, 2021

What do you think would be necessary to generate rhyming text with a particular phrasing / rhythm?

e.g. in the style of a particular rapper?

If you just fine-tune on a corpus of their lyrics, you might miss the underlying poetic constraints.

If there were an additional prior (a "poetry / assonance / rhyme" model), what is the easiest way to constrain generation to respect this prior?

Thanks!

drusepth · on Jan 23, 2021

I wrote "Stylistic Rhyme-bound Poetry Generation or: How You Too Can Generate Sonnets in the Style of Kanye West" [1] back in 2017 for an easy DIY introduction to this topic. You specify the rhyming scheme (ABAB CDCD etc) and it forces end-line rhymes around it.

It uses Markov chains instead of GPT-2, but the approach should work with prompt-based things like GPT-2 also: for lines that are "free" (e.g. no specific word you need to rhyme with), you can generate the line normally -- but for lines you need to rhyme with a specific word, you can just generate last-word-first and generate backwards. For a strictly LTR prompt like GPT-2, you could probably just reverse your corpus word order, generate "reverse" lines with GPT-2 given the previous line + word you need to rhyme with as the prompt, and then reverse it back to "normal" in postprocessing.

[1] https://festivalpeak.com/stylistic-rhyme-bound-poetry-genera...

Some examples of the output of this approach:

[2] https://medium.com/words-of-mimicry/kanye-west-ballade-1-a6f...

[3] https://medium.com/words-of-mimicry/me-you-and-slow-sip-slow...

I'd expect the output to be better with something like GPT-2/3, since Markov chains are so twentieth-century, but I was pretty happy at the output quality even though it often rhymed the same word repeatedly; you could improve it by weighting previously-used words, removing them from the pool of rhyming words, and/or backtracking to previous lines when you find yourself without other words to rhyme.

minimaxir · on Jan 23, 2021

A paper was recently released for that particular use case (https://github.com/markriedl/weirdai), in which it describes a number of technical caveats (and it's technically not using GPT-2).

I do think it's possible to train a GPT-2-esque network to do something similar, albeit with some text encoding shenanigans.

FL33TW00D · on Jan 23, 2021

As far as I know to get a V100 you need Colab Pro? Did this change recently?

minimaxir · on Jan 23, 2021

It's unclear. I've heard people get the V100 without Colab Pro. Albeit I do use Colab Pro and get a V100 almost each time.

As an aside, if you do get a V100, Colab Pro is by-far the cheapest way to train an AI model. ($10/mo is much, much cheaper than $2.48+/hr on GCP normally!) Although you need to sync checkpoints to off-loaded storage in case the Notebook dies.

fpgaminer · on Jan 23, 2021

> As an aside, if you do get a V100, Colab Pro is by-far the cheapest way to train an AI model.

But others should be aware that you get what you pay for. Google still rate limited me when I used Colab Pro, and I ran into a myriad of other small problems. If that's all one is willing to spend to play with AI, 100% go for it. It's a great place to start. But if you're at all serious and can afford it, I think a local machine with a modest GPU is worth every penny.

nsomaru · on Jan 23, 2021

Curious; is it better to train locally on something like a 2080ti 11G or go for colab and offload checkpoints to S3?

Asking because it seems V100 performance (or the other colab paid GPU) is worth the occasional instability if you’ve set up checkpoints.

mdda · on Jan 24, 2021

Look under "FP16 16-bit (Half Precision) Floating Point Calculations" on https://www.microway.com/knowledge-center-articles/compariso...

These raw numbers don't tell the whole story, of course. But IMHO, the convenience of a local 2080Ti outweighs the speed benefits of an _somewhat flaky_ V100 via Colab for day-to-day use (unless memory size is an issue, which you can't really get around).

OTOH, for just trying out stuff / one-offs, Colab is perfect - and bonus points if you score a V100.

byefruit · on Jan 23, 2021

Alas, only if you live in the US.

Colab Pro isn't available outside the US (without breaking Google's terms).

infinite8s · on Jan 25, 2021

US and Canada.

zirkonit · on Jan 23, 2021

First off -- the author has done an amazing tutorial, it's very enjoyable, so I am by no means throwing a shade.

But a week of TPUv3-128 is anywhere between $10k and $20k in TPU costs alone; saying that this is an "at home" kind of experiment is cheeky at best, clickbait at worst.

bkkaggle · on Jan 23, 2021

Hi, I love that you enjoyed it!

Yeah I totally get your point about the title—the TPU quota that I got was close to about the equivalent of $20k—but in my defense I don't have any other access to compute beyond anything that I get through the TFRC or through google colab

superasn · on Jan 23, 2021

Yes it's an amazing tutorial. Thank you.

Speaking as a hobbyist, earlier if you had enough determination you could create just about any software if you kept hacking at it long enough. CPU or cost was generally not an issue, your time and tenacity was.

This has now unfortunately changed and innovation in software (esp ML) is now largely more about how deep are you pockets are.

phreeza · on Jan 23, 2021

I think this is quite a rose colored view of the past. Rendering with many graphics techniques was out of reach for hobbyists for a long time for example.

nabla9 · on Jan 23, 2021

Many hobbies cost $10k-$20k. If you work in engineering, that's not far away from "at home" hobbies.

The time that went into this project was almost certainly worth more than $10k.

6gvONxR4sf7o · on Jan 23, 2021

I imagine you’re speaking about the cost of e.g. setting up a wood shop in your garage, rather than the cost of making something in said wood shop. Training this seems more like the latter, while the comparable cost is the former.

nabla9 · on Jan 23, 2021

If you train this model and then use it to do other interesting things, training big models is like a setting up a wood shop.

Closi · on Jan 23, 2021

If your hobby is building wood furniture, a wood shop helps you do that hobby into the future. It will improve your projects, and help your enjoyment of your hobby. The tools also hold some sort of residual value.

If your hobby is building AI/ML models, a one-shot trained model isn’t going to really help you on an ongoing basis. It’s an amazing single shot project, but if your hobby is actually ML then you probably aren’t going to be happy just looking at your completed trained model - you are going to want to train a bigger, better model.

And if your hobby is building software, you can just download a pre-trained model for free.

I don’t think the analogy holds the other way.

fpgaminer · on Jan 23, 2021

You can download a pretrained, full size GPT-2 for $0. Training it from scratch would be merely for fun. You can fine tune the model if you have a specific application for far, far less cost ($0-$10).

It's not comparable to a hobby. It's comparable to paying $10k to make a sandwich.

JacobSuperslav · on Jan 23, 2021

setting up and growing a garden to make a sandwich from scratch

polytronic · on Jan 23, 2021

The author at 17 years of age can understand academics and research. Has the skills and dedication to go through an exercise of reconstructing state-of-the-art.

I can't help but feel pride and hope for the future, both the author's and the world.

fpgaminer · on Jan 23, 2021

I was watching an ICML presentation and was surprised by the presenter's (not OP, a different AI prodigy) apparent age. Well turns out he was 17 and a 2nd year PhD student. I think he graduated from UC Davis when he was 14 or something.

Some people roll wicked real life DnD character sheets, that's for sure.

make3 · on Jan 24, 2021

and parents

alexpeattie · on Feb 5, 2021

The article has moved here: https://bilal2vec.github.io/blog/algpt2/2020/07/17/ALGPT2-pa...

kyberias · on Jan 23, 2021

How many off-the-shelf GPUs are needed to replicate GPT-2 in a year?

minimaxir · on Jan 23, 2021

With current improvements to training performance and parallelism (e.g. DeepSpeed: https://www.deepspeed.ai ) it wouldn't surprise me if creating GPT-2 small from scratch becomes possible with a couple 3080s in days, with GPT-2 XL not taking 10x longer.

moyix · on Jan 23, 2021

I agree. I've been training on 2x3090s connected via NVLink and they're really fast for training language models. I am actually tempted to try and replicate the OP's GPT2 replication using Huggingface, DeepSpeed, and OpenWebText, but the GPUs are occupied right now training a GPT2-774M C language model...

natch · on Jan 23, 2021

What software stack are you using to get your 3090s working? Any hitches along the way?

moyix · on Jan 23, 2021

Linux (Ubuntu 20.04) + Cuda 11.2. For the backend I use PyTorch; Tensorflow has some nice optimizations (like XLA, which uses LLVM to JIT optimized code for the GPU), but I found it very painful to get working reliably, and most of the language modeling stuff I've seen uses PyTorch.

For the language model training itself I've been experimenting with a few different things. I started off with Huggingface because it's very easy to get up and running, and I still use its tokenizers library to do BPE training on the C source dataset (though there are still some hitches there – other libraries expect slightly different formats for the tokenizer model, like using different ways to represent the <|endoftext|> marker).

After prototyping the C language model training at home, I tried moving the training up to NYU's HPC cluster, which has a bunch of 4xV100 and 4xRTX8000 nodes (mainly because the sound of two powerful GPU fans running at 100% gets a bit old after a while). Unfortunately I discovered that with larger models the GPU-GPU communication overhead can be prohibitive (most of the cluster nodes only support P2P GPU communication over PCIe, which is a lot slower than NVLink), and Huggingface's implementation actually performed worse on multiple GPUs than on two 3090s with NVLink (I opened an issue track it here https://github.com/huggingface/transformers/issues/9371 ).

Currently I'm working on getting DeepSpeed running so that I can hopefully get better scaling even in the absence of a fast GPU-GPU interconnect. This is again a little bit annoying, because it seems like every framework wants a slightly different way of representing the tokenizer and training data – I've had to preprocess the dataset in about 4 different ways (plain text, loose JSON, npy (for DeepSpeed), and a custom indexed binary format for Megatron-LM). I'm also hoping to try out Huggingface's recently-released DeepSpeed integration, which (if it works) would be a really nice combination of usability and performance: https://huggingface.co/blog/zero-deepspeed-fairscale

As for other software stack hitches: so, so many. The main one is just managing the different versions of CUDA. The 3090 is only supported starting with CUDA 11.1, but many packages and frameworks only support 11.0 at best. And some of the newer things like DeepSpeed use PyTorch extensions, which require you to have the exact version of CUDA around that was used to build PyTorch. So I've had to do a fair bit of compiling packages from source rather than relying on prebuilt packages.

The path of least resistance here is probably to use the NVIDIA NGC containers, but it took NVIDIA more than a month to get them updated after the 3090 was released, and I find working inside containers for everything inconvenient anyway (I hate losing my bash history, and I always accidentally end up losing data or local changes when I exit a container).

Anyway, this ended up being a bit more rambling than I intended, but it was helpful to write it all down and maybe it'll help someone else avoid some stumbling blocks :)

notnap · on Jan 24, 2021

Thanks for sharing.

I'm using 2080ti for my "at home" projects and going above CUDA 11.0 is indeed breaks lots of things - good luck trying to make something like colab/kaggle docker image for DS where you have TF+torch+sklearn+more combo.

Would you mind sharing resources you used to assemble your software/driver stack for 3090x2 setup?

moyix · on Jan 24, 2021

I went with the NVIDIA Ubuntu repos:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/i...

This is nice because staying updated can be done through the usual apt commands, and you can have multiple CUDA versions installed by installing cuda-11-0, cuda-11-1, etc. Switching between them seems to be as easy as just making sure the appropriate /usr/local/cuda-<version>/bin dir is first in your PATH.

In general the NVIDIA documentation is pretty good, and I try to follow their Ubuntu-specific instructions whenever possible.

natch · on Jan 24, 2021

Thanks for all the detail! A lot to chew on here. I'm using just one 3090 but getting it stood up is taking some doing. Getting close now with nvidia's docker container but they have conflicting instructions on different pages, so it's "fun."

Pretty new to containers, so I'm glad you alerted me to those issues. Not sure which way I will go in the end but I'm on Ubuntu and trying to get Stylegan2 up… it was working great with a 1080ti.

Jack000 · on Jan 23, 2021

Does nvlink actually help? It's mostly useful for transferring data between gpus so I assume you're using pipeline parallelism or similar?

moyix · on Jan 24, 2021

For large models it does help! The training loop for multiple GPUs with data parallelism is roughly:

1. Split the data up

2. Do a forward and backward pass on each GPU individually

3. Compute the average of the gradients and update the model on each GPU

4. Repeat

For step 3 you need to send the gradients from each GPU somewhere, and then send back either the averaged gradient or the updated model weights. So when the model is large (say, 3GB for GPT 774M!) that's a lot of GPU-GPU communication!

You're right that for the vast majority of ML cases, the models are small enough that the synchronization cost is negligible, though.

I wrote up some benchmarks here:

https://github.com/huggingface/transformers/issues/9371

make3 · on Jan 24, 2021

maybe so, but the largest one, the 1.5B parameters, will very likely take months to train on a single gpu. I've tried to fine-tune it, with a 256 slice of TPUv2, which is huge, and it took a few days

deeviant · on Jan 23, 2021

At home, in the cloud, for tens of thousands of $$$.

dane-pgp · on Jan 23, 2021

"Mom, can I have a GPT-2?"

"No, we have GPT-2 at home."

GPT-2 at home: [Outputs this comment]

soohamr · on Jan 24, 2021

UWaterloo has such precocious students

amelius · on Jan 23, 2021

TL;DR:

> Unfortunately, ALGPT-2 doesn’t perform as well as GPT-2 (ALGPT-2 gets 313131 ppl on OpenWebText compared to 212121 ppl for my pretrained GPT-2 model), but I’m writing this series of blog posts to go through everything I’ve learned over the last few months.

make3 · on Jan 24, 2021

the way he describes the process he went through is still super helpful