An Introduction to GPU Programming in Julia

maxbrunsfeld · on Oct 19, 2018

> GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU. This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia

Every time I read about Julia, I’m amazed. What a game changing tool.

daenz · on Oct 18, 2018

GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to let you do it in javascript, in the browser, awhile back https://github.com/amoffat/gpgpu.js

The thing to note about GPU programming is that the vast majority of overhead comes from data transfer. Sometimes, it is net faster to do the computation on the CPU, if your data set and data results are very large, even if the GPU performs each calculations faster on average due to parallelism. To illustrate, look at the benchmarks on gpgpu.js running a simple kernel:

  CPU: 6851.25ms
  GPU Total: 1449.29ms
  GPU Execution: 30.64ms
  GPU IO: 1418.65ms
  Theoretical Speedup: 223.59x
  Actual Speedup: 4.73x

The theoretical speedup excludes data transfer while actual speedup includes it. The longer you can keep your data set on the GPU to do more calculations (avoiding back and forth IO), the bigger your net speed gains are.

johndough · on Oct 18, 2018

Note that `gl.finish` is often implemented as non-blocking in WebGL, so measuring query submission time on the host (cpu) does not necessarily reflect execution time on the device (gpu). It is more accurate to use timer queries. For more info see: https://stackoverflow.com/questions/20798294/is-it-possible-...

pjmlp · on Oct 18, 2018

WebGL compute shaders are being discussed, it might be a possible improvement to your library when they finally get adopted.

johndough · on Oct 18, 2018

Are there any news on WebGL compute shaders?

I remember OpenCL for Firefox being discussed and discarded in favor of compute shaders about 7 years ago, and then when WebGL 2.0 was finally released 5 years later, compute shaders were not part of it.

Additionaly, SIMD.js has been developed and then killed again and support for SIMD in WebAssembly has been delayed, so I don't believe that we'll be able to implement efficient numerical methods in the browser any time soon.

pjmlp · on Oct 19, 2018

Here is the proposal.

https://docs.google.com/document/d/1EwhDJO_lBH1mGMMwheQUXGhh...

fulafel · on Oct 19, 2018

Note that you can do GPGPU with the current WebGL, compute shaders don't have that many additional capabilities unless you are looking do computation without interference from from onscreen WebGL graphics.

For a future GPGPU web api, the vendors seem to have settled behind https://github.com/gpuweb/gpuweb/

obl · on Oct 19, 2018

I wonder how they are going to pull that off without security implications.

If you've played with compute shaders (or any of the modern "general purpose" shader stuff, ie arbitrary loads & stores etc) you probably know that it's quite easy to crash drivers. Of course you do so generally by provoking some form of UB (although not always, their runtime/compilers are far from bug-free).

But WebGL can't have that, so I don't see how they could pull that off without adding a ton of runtime bound checks to shaders, like they do for index buffers right now but on the GPU side this time.

Not only would that be bad for performance, but I still would never trust this whole stack to run arbitrary code.

fulafel · on Oct 19, 2018

Compute shaders are no different in this regard from other types of shaders that are currently supported by WebGL. They have the same capabilities and data types as other shaders. The difference is when they are invoked and the lack of implicit inputs/outputs. This is needed to perform computation out of lockstep with the graphics pipeline when you are also rendering graphics.

The existing WebGL 1 & WebGL 2 shader compilation process does involve "a ton of checks", shaders are recompiled to have known safe memory accesses, eliminate uninitialized memory/variables, etc.

Usually in compiler technology bounds checking does not involve a big cost as static sizes are zero cost and most dynamic checks can be hoisted out of loops.

Athas · on Oct 18, 2018

I'm a bit surprised to see that GPU Mandelbrot is only at best x75 faster than (sequential?) CPU. Does Julia just generate really fast (multicore/vectorized?) CPU code? Does it also count communication costs? Fractal computations like that are extremely GPU friendly because they involve no memory accesses at all, except for writing the final result. I would expect at least two orders of magnitude improvement over a straightforwardly written C implementation.

Athas · on Oct 18, 2018

I had a suspicion that the problem was that maxiters was set fairly low (16). For the Mandelbrot sets I'm used to, this would result in relatively little computation (over a hundred iterations is more common). To investigate this hunch, I wrote a Futhark program as a proxy for a hand-written GPU implementation (didn't quite have the patience for that this evening):

    import "lib/github.com/diku-dk/complex/complex"

    module c32 = mk_complex f32
    type c32 = c32.complex

    let juliaset (maxiter: i32) (z0: c32) =
      let c = c32.mk (-0.5) 0.75
      let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
                   (c32.(z * z + c), i+1)
      in i

    let main (n: i32) (maxiter: i32) =
      let N = 2**n
      let (w, h) = (N, N)
      let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
                                       let j = -1.5 + r32 j*(3/r32 h)
                                       in c32.mk i j)
      in map (map (juliaset maxiter)) q

I don't have the tools or skills to easily generate nice graphs, but I get about a x271 speedup when this code is compiled to OpenCL and run on a Vega 64 GPU, versus when it is compiled to C. The C code runs in 272ms, which is very close to the number reported for Julia here[0] (I assume that the page has a typo and that the N=2²⁴ column actually means N=2¹², because an N=2²⁴ image would take up dozens of TiB in GPU memory. Also, the benchmark code linked only goes up to N=2¹².)

Unfortunately, if I change maxiters to 256, the speedup actually drops, to x182. So much for that theory.

Edit: also tried on an NVIDIA GTX 780Ti, and the results are the same. My new theory is that the Julia code also counts the cost of moving the resulting array back to the host (which can easily be dominant for a kernel that runs for just a millisecond or two).

[0]: https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/res...

ummonk · on Oct 18, 2018

It does autovectorize quite often.

currymj · on Oct 18, 2018

While having a Torch-esque GPU ndarray is great, the ability to easily write your own kernels without having to compile gnarly C++ code is what sets Julia apart from competitors IMO. Not sure if there's any other dynamic language offering anything like this.

schmudde · on Oct 18, 2018

ClojureCL has already been mentioned, but I found the author's interview on the Defn podcast particularly enlightening: https://soundcloud.com/defn-771544745/defn-24-v3

dragandj · on Oct 18, 2018

ClojureCL and ClojureCUDA have been doing it for 4 years now.

https://clojurecl.uncomplicate.org

https://clojurecuda.uncomplicate.org

KenoFischer · on Oct 18, 2018

Am I missing something? I was excited about other languages doing the same as Julia, but the first example for ClojureCUDA has

    (def kernel-source
          "extern \"C\"
             __global__ void increment (int n, float *a) {
                   int i = blockIdx.x * blockDim.x + threadIdx.x;
               if (i < n) {
                  a[i] = a[i] + 1.0f;
                }
           };")

Is there an example where the kernel is written in Clojure?

dragandj · on Oct 18, 2018

You can of course create Clojure/lisp code that would generate kernel code, but I am strongly against it as the default option, since I find plain C much better suited to small kernels close to hardware (which are only a small proportion of the overall code) than either lisp or Julia.

Please note that there is no need for out-of process compilation step. You compile this by calling a Clojure function, interactively from the REPL.

KenoFischer · on Oct 18, 2018

Sure, but that's a completely separate thing from what Julia is doing here. The whole point of the GPU backend is that you get almost the full flexibility of the Julia language at your disposal, without having to fall back to C code. As a result, not only is your code more familiar to Julia programmers, but you also get to share large amounts of code between CPU/GPU/TPU. If all you're doing is super small kernels that may not matter much, but GPUs are so powerful these days that it's perfectly sensible to offload rather large programs that you may not want to write in C anymore.

3rdAccount · on Oct 18, 2018

It's not just easier to read for Julia programmers. Any pythonista can mostly read Julia without significant issue. They're both high level and syntactically similar enough. I know Python is a lot more OO, but when you're just writing a short script it looks a lot like Julia if Julia didn't need you to specify types.

byt143 · on Oct 18, 2018

Also that exact C example can be written almost verbatim in Julia, so C isn't "better suited" for even these small kernels, because one can drop to low level with even clearer syntax in julia

pjmlp · on Oct 18, 2018

"Parallel computations on the GPU with CUDA in Clojure"

https://clojurecuda.uncomplicate.org/

imtringued · on Oct 19, 2018

The article is about compiling a restricted subset of Julia code to run on a GPU and accelerate your project without leaving the language.

The library you linked doesn't compile clojure code to run on the GPU. It's basically just a FFI wrapper to pass C/C++ kernels to CUDA.

pjmlp · on Oct 19, 2018

Yeah, I missed that part.

pjmlp · on Oct 18, 2018

Love it! So much more fun than being stuck with C derived languages for GPGPU programming.

tanderson92 · on Oct 18, 2018

You can do GPGPU programming with Fortran, which is definitely not C-derived, predating C by at least 20 years.

3rdAccount · on Oct 19, 2018

I find Fortran pretty pleasant for scientific work (what it was designed for) and take pleasure in knowing it hasn't changed a whole lot over the years F70, F90...etc.

pjmlp · on Oct 18, 2018

Sure, and it is a good alternative option that I forgot about, given that it is anyway its strong point, numerical computing.

However it doesn't win many hearts outside HPC nowadays even with the latest standard revisions, and I was thinking more about mainstream adoption.

There are also Accelerate, CUDA4J, ClojureCUDA, Hybridizer and Alea, but I am not sure about their adoption at large.

eigenspace · on Oct 18, 2018

It seems kinda weird to tout how great it is that we have CuArrays and CLArrays when CLArrays haven't been updated for 1.0 and only claims experimental support for 0.6.

Really hoping we see some movement on CLArrays in the near future.

ChrisRackauckas · on Oct 18, 2018

I want to see it updated too. It was really nice to write/test/debug GPU codes on my cheap laptop without an NVIDIA GPU, and then switch over to my desktop and CUDA with single line changes. Even though CuArrays tends to be quite a bit faster, this convenience cannot be overstated. I didn't realize how much I'd miss it until I was at a conference and couldn't just fiddle around with some GPU integration I had to debug.

I think the reason CLArrays.jl only claims experimental support on v0.6 is because it uses Transpiler.jl which is quite limited and a direct translation. CUDANative.jl is much more advanced and uses LLVM's .ptx backend. I think Simon mentioned he wanted to do a more direct compilation route, and that would probably be the change to start calling it experimental? I don't know, I'd like to hear about the future of CLArrays as well. If we have both working strong, boy oh boy I'll be happy.

keldaris · on Oct 18, 2018

It is clearly mentioned in a separate paragraph which reads:

> For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.

eigenspace · on Oct 18, 2018

My bad, I should really finish reading things before I comment.

boromi · on Oct 18, 2018

sadly OpenCL seems to always get ignored

pjmlp · on Oct 18, 2018

They did that to themselves by sticking originally with C, and forcing the whole read text, compile, link process at runtime, instead of the multiple language support from CUDA.

OpenCL has been improved since then, but now it is too late.

jessfyi · on Oct 19, 2018

If anyone wants to try out a free GPU using Google Colab/Jupyter (K80, you might run into ram allocation issues if you're not one of the lucky users who get to use the full amount) here's a quick guide to get a Julia kernel up and running: https://discourse.julialang.org/t/julia-on-google-colab-free...

simondanisch · on Oct 19, 2018

Or just directly edit the article ;) That will also give you a K80 and you can directly run the code inside the article

IshKebab · on Oct 18, 2018

It doesn't really describe the fundamental difference between a GPU and a 4000-core CPU, which is that the GPU has a shared program counter. All the cores must execute the same instruction at each cycle.

twtw · on Oct 18, 2018

The conceptually independent threads are executed on 32 wide SIMD cores. "All the cores" within a warp/wavefront must execute the same instruction each cycle.

GPU threads got even more independent in nvidia'a volta architecture - search Volta ITS for details.

why_only_15 · on Oct 18, 2018

Are you sure? I'm pretty sure I can branch on core-specific measures like (in CUDA) TheadIdx and BlockIdx

gpuhacker · on Oct 18, 2018

It's true that the programming model allows that, but underneath all threads within the warp will execute the same instructions. However if there is a branch, some threads can be predicated so their instructions have no effect. This is called warp divergence and can become a performance issue. If possible branch only on threadidx using multiples of the warp size. There's a cool slide deck on implementing a parallel sum algorithm that explains this really well.

dialecticDolt · on Oct 19, 2018

Do you have any idea what the source of the slide deck could have been? It sounds very interesting and I'd love to see it.

gpuhacker · on Oct 19, 2018

Sure! I meant this one: https://developer.download.nvidia.com/assets/cuda/files/redu...

0-_-0 · on Oct 19, 2018

Yes but can you do that from Julia without writing CUDA/OpenCL code?

I'm happy that Julia supports GPU programming for simple code, but I don't see how you can run algorithms with inter-thread communication.

simondanisch · on Oct 19, 2018

That's how you do it:

https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/lin...

Or directly with cuda shfl intrinsics: https://github.com/JuliaGPU/CUDAnative.jl/blob/b249dfd145501...

0-_-0 · on Oct 20, 2018

Oh cool! Thanks!

byt143 · on Oct 19, 2018

Yes you can.

llukas · on Oct 19, 2018

On NVIDIA Volta architecture threads have individual PC.