> GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU. This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia
Every time I read about Julia, I’m amazed. What a game changing tool.
GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to let you do it in javascript, in the browser, awhile back https://github.com/amoffat/gpgpu.js
The thing to note about GPU programming is that the vast majority of overhead comes from data transfer. Sometimes, it is net faster to do the computation on the CPU, if your data set and data results are very large, even if the GPU performs each calculations faster on average due to parallelism. To illustrate, look at the benchmarks on gpgpu.js running a simple kernel:
The theoretical speedup excludes data transfer while actual speedup includes it. The longer you can keep your data set on the GPU to do more calculations (avoiding back and forth IO), the bigger your net speed gains are.
Note that `gl.finish` is often implemented as non-blocking in WebGL, so measuring query submission time on the host (cpu) does not necessarily reflect execution time on the device (gpu). It is more accurate to use timer queries. For more info see: https://stackoverflow.com/questions/20798294/is-it-possible-...
I remember OpenCL for Firefox being discussed and discarded in favor of compute shaders about 7 years ago, and then when WebGL 2.0 was finally released 5 years later, compute shaders were not part of it.
Additionaly, SIMD.js has been developed and then killed again and support for SIMD in WebAssembly has been delayed, so I don't believe that we'll be able to implement efficient numerical methods in the browser any time soon.
Note that you can do GPGPU with the current WebGL, compute shaders don't have that many additional capabilities unless you are looking do computation without interference from from onscreen WebGL graphics.
I wonder how they are going to pull that off without security implications.
If you've played with compute shaders (or any of the modern "general purpose" shader stuff, ie arbitrary loads & stores etc) you probably know that it's quite easy to crash drivers. Of course you do so generally by provoking some form of UB (although not always, their runtime/compilers are far from bug-free).
But WebGL can't have that, so I don't see how they could pull that off without adding a ton of runtime bound checks to shaders, like they do for index buffers right now but on the GPU side this time.
Not only would that be bad for performance, but I still would never trust this whole stack to run arbitrary code.
Compute shaders are no different in this regard from other types of shaders that are currently supported by WebGL.
They have the same capabilities and data types as other shaders. The difference is when they are invoked and the lack of implicit inputs/outputs. This is needed to perform computation out of lockstep with the graphics pipeline when you are also rendering graphics.
The existing WebGL 1 & WebGL 2 shader compilation process does involve "a ton of checks", shaders are recompiled to have known safe memory accesses, eliminate uninitialized memory/variables, etc.
Usually in compiler technology bounds checking does not involve a big cost as static sizes are zero cost and most dynamic checks can be hoisted out of loops.
I'm a bit surprised to see that GPU Mandelbrot is only at best x75 faster than (sequential?) CPU. Does Julia just generate really fast (multicore/vectorized?) CPU code? Does it also count communication costs? Fractal computations like that are extremely GPU friendly because they involve no memory accesses at all, except for writing the final result. I would expect at least two orders of magnitude improvement over a straightforwardly written C implementation.
I had a suspicion that the problem was that maxiters was set fairly low (16). For the Mandelbrot sets I'm used to, this would result in relatively little computation (over a hundred iterations is more common). To investigate this hunch, I wrote a Futhark program as a proxy for a hand-written GPU implementation (didn't quite have the patience for that this evening):
import "lib/github.com/diku-dk/complex/complex"
module c32 = mk_complex f32
type c32 = c32.complex
let juliaset (maxiter: i32) (z0: c32) =
let c = c32.mk (-0.5) 0.75
let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
(c32.(z * z + c), i+1)
in i
let main (n: i32) (maxiter: i32) =
let N = 2**n
let (w, h) = (N, N)
let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
let j = -1.5 + r32 j*(3/r32 h)
in c32.mk i j)
in map (map (juliaset maxiter)) q
I don't have the tools or skills to easily generate nice graphs, but I get about a x271 speedup when this code is compiled to OpenCL and run on a Vega 64 GPU, versus when it is compiled to C. The C code runs in 272ms, which is very close to the number reported for Julia here[0] (I assume that the page has a typo and that the N=2²⁴ column actually means N=2¹², because an N=2²⁴ image would take up dozens of TiB in GPU memory. Also, the benchmark code linked only goes up to N=2¹².)
Unfortunately, if I change maxiters to 256, the speedup actually drops, to x182. So much for that theory.
Edit: also tried on an NVIDIA GTX 780Ti, and the results are the same. My new theory is that the Julia code also counts the cost of moving the resulting array back to the host (which can easily be dominant for a kernel that runs for just a millisecond or two).
While having a Torch-esque GPU ndarray is great, the ability to easily write your own kernels without having to compile gnarly C++ code is what sets Julia apart from competitors IMO. Not sure if there's any other dynamic language offering anything like this.
You can of course create Clojure/lisp code that would generate kernel code, but I am strongly against it as the default option, since I find plain C much better suited to small kernels close to hardware (which are only a small proportion of the overall code) than either lisp or Julia.
Please note that there is no need for out-of process compilation step. You compile this by calling a Clojure function, interactively from the REPL.
Sure, but that's a completely separate thing from what Julia is doing here. The whole point of the GPU backend is that you get almost the full flexibility of the Julia language at your disposal, without having to fall back to C code. As a result, not only is your code more familiar to Julia programmers, but you also get to share large amounts of code between CPU/GPU/TPU. If all you're doing is super small kernels that may not matter much, but GPUs are so powerful these days that it's perfectly sensible to offload rather large programs that you may not want to write in C anymore.
It's not just easier to read for Julia programmers. Any pythonista can mostly read Julia without significant issue. They're both high level and syntactically similar enough. I know Python is a lot more OO, but when you're just writing a short script it looks a lot like Julia if Julia didn't need you to specify types.
Also that exact C example can be written almost verbatim in Julia, so C isn't "better suited" for even these small kernels, because one can drop to low level with even clearer syntax in julia
I find Fortran pretty pleasant for scientific work (what it was designed for) and take pleasure in knowing it hasn't changed a whole lot over the years F70, F90...etc.
It seems kinda weird to tout how great it is that we have CuArrays and CLArrays when CLArrays haven't been updated for 1.0 and only claims experimental support for 0.6.
Really hoping we see some movement on CLArrays in the near future.
I want to see it updated too. It was really nice to write/test/debug GPU codes on my cheap laptop without an NVIDIA GPU, and then switch over to my desktop and CUDA with single line changes. Even though CuArrays tends to be quite a bit faster, this convenience cannot be overstated. I didn't realize how much I'd miss it until I was at a conference and couldn't just fiddle around with some GPU integration I had to debug.
I think the reason CLArrays.jl only claims experimental support on v0.6 is because it uses Transpiler.jl which is quite limited and a direct translation. CUDANative.jl is much more advanced and uses LLVM's .ptx backend. I think Simon mentioned he wanted to do a more direct compilation route, and that would probably be the change to start calling it experimental? I don't know, I'd like to hear about the future of CLArrays as well. If we have both working strong, boy oh boy I'll be happy.
They did that to themselves by sticking originally with C, and forcing the whole read text, compile, link process at runtime, instead of the multiple language support from CUDA.
OpenCL has been improved since then, but now it is too late.
If anyone wants to try out a free GPU using Google Colab/Jupyter (K80, you might run into ram allocation issues if you're not one of the lucky users who get to use the full amount) here's a quick guide to get a Julia kernel up and running: https://discourse.julialang.org/t/julia-on-google-colab-free...
It doesn't really describe the fundamental difference between a GPU and a 4000-core CPU, which is that the GPU has a shared program counter. All the cores must execute the same instruction at each cycle.
The conceptually independent threads are executed on 32 wide SIMD cores. "All the cores" within a warp/wavefront must execute the same instruction each cycle.
GPU threads got even more independent in nvidia'a volta architecture - search Volta ITS for details.
It's true that the programming model allows that, but underneath all threads within the warp will execute the same instructions. However if there is a branch, some threads can be predicated so their instructions have no effect. This is called warp divergence and can become a performance issue. If possible branch only on threadidx using multiples of the warp size. There's a cool slide deck on implementing a parallel sum algorithm that explains this really well.
Every time I read about Julia, I’m amazed. What a game changing tool.