An in-depth look at OCaml’s new “best-fit” garbage collector strategy

StreamBright · on March 24, 2020

"Remember that whatever works best for you, it’s still better than having to malloc and free by hand. Happy allocating!"

Nice, they are saying exactly the same as those pesky game developers.

https://www.youtube.com/watch?v=tK50z_gUpZI

pjmlp · on March 24, 2020

Yeah, like Tim Sweeney.

"It's interesting that many games can afford a constant 10x interpretation overhead for scripts, but not a spikey 1% for garbage collection."

https://twitter.com/timsweeneyepic/status/880607734588211200

https://wiki.unrealengine.com/Garbage_Collection_Overview

Which was it again, the engine chosen by Nintendo, Microsoft and Google as first party to their 3D APIs?

https://developer.nintendo.com/tools

https://docs.microsoft.com/en-us/windows/mixed-reality/unity...

https://stadia.dev/blog/unity-production-ready-support-for-s...

https://developer.android.com/games/develop/build-in-unity

The anti-GC crowd on the games industry, is no different than the ones that fought adoption of C/Modula-2/Pascal over Assembly, and then fought adoption of C++ and Objective-C over C.

Eventually they will suck it up when the major platform owners tell them it is time to move on.

marcinzm · on March 24, 2020

>"It's interesting that many games can afford a constant 10x interpretation overhead for scripts, but not a spikey 1% for garbage collection."

Why is that surprising?

Games are basically about humans predicting things and random spikes prevent that from happening in time sensitive games. Beyond game play implications, I suspect there's also something about jerkiness in movement that bugs human senses.

sgrove · on March 24, 2020

It's not entirely surprising, but one might imagine a different approach: always allocate a 1% buffer for an unexpected GC.

It's not a very satisfactory answer (and there are likely much better tradeoffs to be made), but given the 10x and 1% comparison (not entirely apples to apples though) the comment sounds a bit more interesting.

exsf0859 · on March 25, 2020

The problem is that with most GC algorithms 1% of the frames take 2x (or even 20x) as long, not that individual frame times vary by 1%.

loopback_device · on March 25, 2020

How do you allocate a 1% buffer of time? The GC issue in games (or any other timing sensitive application) is about the "world freezing", not memory. Which is why incremental GC is a thing. At best, you're borrowing time.

otabdeveloper2 · on March 25, 2020

The problem with GC isn't the "spikey 1%", it's the fact that GC implies the "everything is a pointer to an object" programming model, which in turn fragments your memory and destroys your cache.

Performance-oriented code implies everything is on the stack and/or packed into large arrays, at which point you don't need a GC after all.

StreamBright · on March 24, 2020

I guess this mentality leads to the current state of play where all UI related latencies are out of the roof compare to the 80s. At work, I deal with systems that require 16G of heap as the minimum. Funnily when things get rewritten in Rust providing the exact same functionality and the same or better performance the memory requirement goes down 10x (or more). It is up to us how much garbage our systems producing, how much CO2 is wasted on this. I guess many of us are ok with it. While some of us are not. https://blog.discordapp.com/why-discord-is-switching-from-go...

whateveracct · on March 25, 2020

I've seen multiple production systems rewritten in Rust and the consensus of the developers working on it is that, while the Rust version was more performant, the majority of that performance is attributable to the rewrite itself and not the language. And that a performance-focused rewrite in the original language would have also seen huge performance gains.

As the truism goes, if you require your software to be performant, you must first make Performance a Requirement.

StreamBright · on March 25, 2020

Sure I bet they feel this way. However, if you look at rewrites like the one at Discord[1], it is dead obvious that the rewrite helped not because of the rewrite by itself but because there is no GC (mostly) and some other smaller things (better data structures for certain tasks). In my experience working a lot for the fortune 500 in the last 15 years is that developers are usually not aware of the low level details of their stack including (but not limited to): networking, garbage collection, concurrency, parallelism. It is the exception (mostly in FAANG companies) when these low level details a very well understood and dealt with. It is no surprise that these companies are picking up Rust because those developers actually understand how much less mental overhead is to work in Rust (once it compiles successfully :) ). I know it is a new language and there are hoops (async/await) but at least we can be sure that there is no need for GC tuning, memory corruption is non existent and a few nice additional properties.

In my mind performance is alway a requirement (and a feature). We just stopped caring a long time ago, because it is easier to think about performance as a hardware or capacity problem, memory as GC problem and so on.

1. https://blog.discordapp.com/why-discord-is-switching-from-go...

whateveracct · on March 24, 2020

When you're doing gamedev & are bumping up against your FPS, the GC is just another form of memory management. Doing GC-per-frame tends to work pretty well with a generational GC (generational hypothesis & frame-by-frame updates go hand-in-hand), but you usually have to take care about long-lived data. That's when you end up getting into more manual memory management combined with a GC. In a way, your GC'd high-level language RTS ends up being your scripting language.

That's how I've been thinking about it with Haskell at least (lots of GC knobs, manual performGC hook, compact regions for having long-lived data, good FFI, as high-level as any scripting language you could hope for)

k__ · on March 24, 2020

What's their opinion on Rust?

Jhsto · on March 24, 2020

I would imagine excited. Rust's affine type system is an application of logic theory. OCaml is initially French academic production and (from an anecdotal experience) those academics tend to dis how impure most software (and memory management) is. While Rust does not have the purest theoretical foundations, it's still fresh air and will likely result in people paying more attention to the work of researchers in theoretics.

gopiandcode · on March 25, 2020

Before self-hosting, the rust compiler was originally in OCaml so presumably there's an overlap in communities there.

k__ · on March 25, 2020

I mean, game developers, not OCaml devs

The point was, they said "do manaual memory management" if you want speed.

StreamBright · on March 24, 2020

Great question. I am really hoping that the non-GC world is taking off with Rust.

joppy · on March 25, 2020

How does Rust deal with long-lived objects that cannot be block-scoped (or request-scoped, in the context of a server for example)? A typical example here would be a UI framework, where memory has to be managed for the window and its widgets, and then the entire application is suspended until the next event. The user can open and close windows in random orders, and so on. Perhaps some windows generate a lot of associated data which should be disposed of when that window closes, but can also be dragged over into the other windows and duplicated around, so the data is not “owned” by a single window.

It seems to me that this is where the GC “set and forget” model really shines, since otherwise you just have to do all that work manually using an allocation and a free, or a constructor and a destructor, or some similar pattern. Perhaps Rust has some clever answer for this?

steveklabnik · on March 25, 2020

> How does Rust deal with long-lived objects that cannot be block-scoped (or request-scoped, in the context of a server for example)?

The issue is not really the scope, the issue is how many owners the data needs to have. A variable that needs to live longer than a scope, but only has a single owner, can be directly returned from that scope. Once you have the need for multiple owners, the most straightforward answer is "reference count them," but it depends on your exact requirements.

> A typical example here would be a UI framework, where memory has to be managed for the window and its widgets, and then the entire application is suspended until the next event.

GTK heavily uses reference counting. People are also investigating what a "rust-native" UI toolkit would look like; taking strong influences from ECSes. It's an open question if those architectures end up better than refcounts.

nitnelave · on March 25, 2020

It seems like you want a reference-counted pointer: you get shared ownership, and the object is deleted when the last reference to the pointer is deleted.

Similar to C++'s std::shared_ptr

joppy · on March 25, 2020

Right, but then you take on a bunch of baggage you don’t have in a GC setting. For instance if you want to make zillions of these objects and delete them you are paying overhead for lots of allocations and deallocations made heavier by the reference counting. You pay time overhead for atomic increment/decrement, and if there are cyclic structures then you pay a big price in the code complexity to deal with them properly and not cause memory leaks.

twic · on March 24, 2020

I'm not aware of any other industrial-strength GC using this strategy. Are there any?

If not, is there something about OCaml that makes this strategy more suitable than it is for other languages?

If not, is this a case of this being the best strategy they have the resources to implement, rather than the best possible strategy?

dfox · on March 24, 2020

I feel that it is mostly about nobody else explicitly calling the strategy best-fit. For example BIBOP-derived GCs (which for purposes of this discussion includes BDW GC) are inherently best(-ish)-fit and in fact traditional unix malloc is also mostly best-fit.

the8472 · on March 24, 2020

I think the hotspot's CMS old gen allocator used best-fit strategy since its collector didn't compact. But CMS has been deprecated because newer, compacting low pause collectors have taken over its use-cases while being less fragile.

hinkley · on March 24, 2020

If memory serves, the new one uses an extra object header that points from the old object to the new one during move operations, and any reads of the old object get forwarded to the new one.

I'm pretty sure that would have not performed well without the aggressive prediction logic in modern processors.

Java 1's object accesses always read through an indirect pointer, but that went away in the name of performance, either when Hotspot was introduced, or on the next round of GC impromevents.

_old_dude_ · on March 25, 2020

They are two new GCs Shenandoah and ZGC.

Indirect pointers or Brooks pointers has it is called were used in Shenandoah v1 to allow an application thread that perform a read to not move the object during the evacuation phase. This strategy has been removed in Shenandoah v2 to have a better throughput so now both read and write by the application move the object during the evacuation phase.

ZGC has never used Brooks pointers.

MaxBarraclough · on March 25, 2020

I had to look up 'Brooks pointers'. For anyone else in this position, these two blog posts seem a good place to start: https://rkennke.wordpress.com/2013/10/23/shenandoah-gc-brook... , https://blog.plan99.net/modern-garbage-collection-part-2-1c8...

rlp · on March 25, 2020

It's not exactly an industrial-strength GC, but Nim uses TLSF to reduce fragmentation: http://www.gii.upv.es/tlsf/. I'm not sure how that compares to the strategy in the article, though.

jjoonathan · on March 24, 2020

Hell is other peoples' algorithmic choices. My GC-fu isn't high level enough to comment on this one, but I just spent the last two days suffering in dependency hell because someone thought it would be a good idea to use a full-blown SAT solver for package management. Grr.

hinkley · on March 25, 2020

Sometimes I have lamented the fact that 80% of everything we're ever going to properly discover in Computer Science has already been discovered forty years ago.

What hasn't been explored very well is how to formulate these solutions so mere mortals can comprehend how they work. Algorithms accessibility is, I believe, the limiting factor on building systems any bigger than the ones we have now. When there is one tricky bit in the code, you can get away with asking people to dive in and learn it. When there are 50? 100? Just figuring out the consequences of how those systems interact is a full time job, let alone how they function internally.

Give me a SAT solver, or a parser, or half a dozen other things, that decomposes a problem the way a human would do it, just faster and with far more accuracy, and I could learn it in a week. Take all my Paxoses and replace them with Raft, then swing by and make a second pass on a few.

imtringued · on March 25, 2020

The real problems are conventions and ease of use. You're not going to solve these problems by pointing at a paper from the 80s and shouting "We did this 40 years ago! Why is everyone so dumb nowadays?".

Lets take WebAssembly as an example. It's only reason for existence is convention. You need everyone to agree on an IR that is not only cross platform and stable but also low level and allows execution of untrusted code within a sandbox.

If you look at competitors then it becomes obvious that they are not following these core principles. LLVM IR just isn't meant to be used in a browser. JVM bytecode just wasn't meant to be used in a browser. So what are we going to do? Use them anyway? That's how you get situations like the horribly insecure JVM plugin. You can restrict the JVM plugin to a subset that is secure enough for browsers and add additional opcodes for low level behaviors but then you are no longer using JVM bytecode. It's a completely new invention at that point but Oracle will still hunt you down.

hyperman1 · on March 25, 2020

This is a bit harsh. JVM byte code was meant to be in the browsers of that time. Plugins were a respected part of the ecosystem back then. There were 2 types of security problems.

The majority were not bytecode related, but were bugs in the interface to the outside world. There is no reason why WebAssembly is better in this regard. E.g. webglvs java' s graphic APIs. This mainly comes from corporate culture prioritizing security. If financial hardship or other stresses befalls the browser maker, webassembly will probably give the same trouble as java had.

The minority were related to the soundness of the jvm itself. Most of these have been fixed. These bugs are in general nasty, as the basics have been valisated by a mathematical proof. A few are still there and very hard to fix, like locking system objects like Thread.class I think this was a learning experience for all secure VMs that follow it, and WebAssembly knew what to avoid because of Java. Only time will tell how good WebAssembly withstands the nasty ideas humanity throws at it.

jeffdavis · on March 25, 2020

It's hard to optimize algorithms while also making them general and composable. We are forever reimplementing good ideas in different combinations of systems.

ignoramous · on March 24, 2020

> ...someone thought it would be a good idea to use a full-blown SAT solver for package management

Relevant: https://research.swtch.com/version-sat

jjoonathan · on March 24, 2020

Right. SAT solvers are an excellent theoretical fit and a terrible practical fit, at least at the current state of tooling. Their runtime _does_ explode and the tooling _is not_ any good at hinting as to why even when the explanation turns out to be very simple. "lol install takes an hour now" makes for a very poor error message, and debugging a black box that takes an hour to evaluate each input is just.... ugggggghh, and I can only thank my lucky stars that it did finish after an hour rather than taking an indefinite amount of time.

In contrast, the "danger" of heuristics is that they fail to dig up an exceedingly clever combination of archaic package versions that technically fit the user's specified requirements. It's such a small problem that it might even be considered a feature, since said exceedingly clever combinations are likely to be the result of poor version definitions and unlikely to be what the user actually wants.

Of course, if the only people who can be persuaded to write package managers are people doing research in the subject, then I suppose letting them inflict their pet projects on us is one way to compensate them for an otherwise thankless task, and perhaps in that sense it's fair.

cwzwarich · on March 25, 2020

Do you have an example of a real-world package dependency situation that generates a truly difficult SAT instance?

woadwarrior01 · on March 25, 2020

That’s because package version selection is a SAT problem.

https://research.swtch.com/version-sat

hyustan · on March 25, 2020

Surprised not see any talk about slab allocations.

Typical malloc implementations today use slabs:

A variety of allocation-classes is defined; for example, 1B, 2B, 4B, 8B, ...

Each allocation-class is essentially its own independent heap.

Slabs are really good:

Allocation is fast: a few cycles to determine the slab, then pick the first available cell, done.

Compaction is easy: all cells have the same size!

And I repeat, all cells within an allocation-class have the same size! This means things like pin cards, etc...

Compared to the pointer-chasing inherent to a splay-tree... I do wonder.

fjfaase · on March 25, 2020

What about the following strategy: Find the first space that is large enough. If it is smaller than double the size of the required, take it. (A little more space is allocated than would be strictly needed.) If it larger than double the size, split it. This leaves a piece that is at least as big as the current size. Assuming that the allocations have some distribution, it is likely that another piece of memory with this size will be allocated in the future. In this way, the distribution of available spaces will remain about the same as the wanted spaces. (Of course, one should also first round up the size to some power of two and possibly implement a minimum size.)

judofyr · on March 25, 2020

Sounds similar to a buddy allocator: https://en.wikipedia.org/wiki/Buddy_memory_allocation

adultSwim · on March 25, 2020

Shout out to Damien Doligez, a phenomenal engineer

jeffdavis · on March 25, 2020

This reminds me somewhat of the main postgresql allocator. It keeps segregated freelists for smaller allocations, and then larger allocations are handled by malloc.