With current improvements to training performance and parallelism (e.g. DeepSpeed: https://www.deepspeed.ai ) it wouldn't surprise me if creating GPT-2 small from scratch becomes possible with a couple 3080s in days, with GPT-2 XL not taking 10x longer.
I agree. I've been training on 2x3090s connected via NVLink and they're really fast for training language models. I am actually tempted to try and replicate the OP's GPT2 replication using Huggingface, DeepSpeed, and OpenWebText, but the GPUs are occupied right now training a GPT2-774M C language model...
Linux (Ubuntu 20.04) + Cuda 11.2. For the backend I use PyTorch; Tensorflow has some nice optimizations (like XLA, which uses LLVM to JIT optimized code for the GPU), but I found it very painful to get working reliably, and most of the language modeling stuff I've seen uses PyTorch.
For the language model training itself I've been experimenting with a few different things. I started off with Huggingface because it's very easy to get up and running, and I still use its tokenizers library to do BPE training on the C source dataset (though there are still some hitches there – other libraries expect slightly different formats for the tokenizer model, like using different ways to represent the <|endoftext|> marker).
After prototyping the C language model training at home, I tried moving the training up to NYU's HPC cluster, which has a bunch of 4xV100 and 4xRTX8000 nodes (mainly because the sound of two powerful GPU fans running at 100% gets a bit old after a while). Unfortunately I discovered that with larger models the GPU-GPU communication overhead can be prohibitive (most of the cluster nodes only support P2P GPU communication over PCIe, which is a lot slower than NVLink), and Huggingface's implementation actually performed worse on multiple GPUs than on two 3090s with NVLink (I opened an issue track it here https://github.com/huggingface/transformers/issues/9371 ).
Currently I'm working on getting DeepSpeed running so that I can hopefully get better scaling even in the absence of a fast GPU-GPU interconnect. This is again a little bit annoying, because it seems like every framework wants a slightly different way of representing the tokenizer and training data – I've had to preprocess the dataset in about 4 different ways (plain text, loose JSON, npy (for DeepSpeed), and a custom indexed binary format for Megatron-LM). I'm also hoping to try out Huggingface's recently-released DeepSpeed integration, which (if it works) would be a really nice combination of usability and performance: https://huggingface.co/blog/zero-deepspeed-fairscale
As for other software stack hitches: so, so many. The main one is just managing the different versions of CUDA. The 3090 is only supported starting with CUDA 11.1, but many packages and frameworks only support 11.0 at best. And some of the newer things like DeepSpeed use PyTorch extensions, which require you to have the exact version of CUDA around that was used to build PyTorch. So I've had to do a fair bit of compiling packages from source rather than relying on prebuilt packages.
The path of least resistance here is probably to use the NVIDIA NGC containers, but it took NVIDIA more than a month to get them updated after the 3090 was released, and I find working inside containers for everything inconvenient anyway (I hate losing my bash history, and I always accidentally end up losing data or local changes when I exit a container).
Anyway, this ended up being a bit more rambling than I intended, but it was helpful to write it all down and maybe it'll help someone else avoid some stumbling blocks :)
I'm using 2080ti for my "at home" projects and going above CUDA 11.0 is indeed breaks lots of things - good luck trying to make something like colab/kaggle docker image for DS where you have TF+torch+sklearn+more combo.
Would you mind sharing resources you used to assemble your software/driver stack for 3090x2 setup?
This is nice because staying updated can be done through the usual apt commands, and you can have multiple CUDA versions installed by installing cuda-11-0, cuda-11-1, etc. Switching between them seems to be as easy as just making sure the appropriate /usr/local/cuda-<version>/bin dir is first in your PATH.
In general the NVIDIA documentation is pretty good, and I try to follow their Ubuntu-specific instructions whenever possible.
Thanks for all the detail! A lot to chew on here. I'm using just one 3090 but getting it stood up is taking some doing. Getting close now with nvidia's docker container but they have conflicting instructions on different pages, so it's "fun."
Pretty new to containers, so I'm glad you alerted me to those issues. Not sure which way I will go in the end but I'm on Ubuntu and trying to get Stylegan2 up… it was working great with a 1080ti.
For large models it does help! The training loop for multiple GPUs with data parallelism is roughly:
1. Split the data up
2. Do a forward and backward pass on each GPU individually
3. Compute the average of the gradients and update the model on each GPU
4. Repeat
For step 3 you need to send the gradients from each GPU somewhere, and then send back either the averaged gradient or the updated model weights. So when the model is large (say, 3GB for GPT 774M!) that's a lot of GPU-GPU communication!
You're right that for the vast majority of ML cases, the models are small enough that the synchronization cost is negligible, though.
maybe so, but the largest one, the 1.5B parameters, will very likely take months to train on a single gpu. I've tried to fine-tune it, with a 256 slice of TPUv2, which is huge, and it took a few days