Ohh, don't mind if I do! I'm working on CPU libraries to improve compression performance. Does zstd benefit today from ISAs like AVX-512, AVX-2, etc.? Do you foresee the need in the future to offload compression/decompression to an accelerator?
Another Q, what hardware platform(s) do you use to test new versions of zstd for regressions/improvements? I see the Github page shows a consumer CPU in use, but what about server CPUs pushing maximum throughput with many threads/cores?
> Does zstd benefit today from ISAs like AVX-512, AVX-2, etc.?
Zstd benefits mostly from BMI(2), it takes advantage of shlx, shrx, and bsf during entropy (de)coding. We do use SSE2 instructions in our lazy match finder to filter matches based on a 1-byte hash, in a similar way that F14 and Swiss hash tables do. We also use vector loads/stores to do our copies during decompression.
We don't currently benefit from AVX2. There are strategies to use AVX2 for Huffman & FSE compression, but they don't fit well with zstd's format, so we don't use them there. Additionally, our latest Huffman decoder is very competitive with the AVX2 decoder on real data. And a Huffman format that is fast for AVX2 is often slow for scalar decoding, so it is hard to opt into it unless you know you will only run on CPUs that support AVX2. Lastly, AVX2 entropy decoders rely on a fast gather instruction, which isn't available on AMD machines.
> Do you foresee the need in the future to offload compression/decompression to an accelerator?
Yes, there are trends in this direction. Intel QAT supports zlib hardware acceleration. The PS5 has accelerated decompression. I'm pretty sure Microsoft has also released a paper about a HW accelerated algorithm. Compression & decompression take up a significant portion of datacenter CPU, so it makes sense to hardware accelerate it.
> what hardware platform(s) do you use to test new versions of zstd for regressions/improvements?
We mostly test and optimize for x86-64 CPUs, on a mix of server and consumer CPUs. But, we also test on ARM devices to make sure we don't have major regressions. We've found that optimizing for x86-64 has done a good job of getting good baseline performance on other architectures, though there is certainly room left to improve.
Another Q, what hardware platform(s) do you use to test new versions of zstd for regressions/improvements? I see the Github page shows a consumer CPU in use, but what about server CPUs pushing maximum throughput with many threads/cores?