Making Ruby’s garbage collector copy-on-write friendly, part 7
It turns out that part 6 was not final. I’ve been working on the garbage collector again. These are the changes:
- The garbage collector is now 20% faster! For every object to be marked or sweeped, the garbage collector has to find the heap that an object belongs to. Empirical evidence shows that the chance is very high that the current object belongs to the same heap as the last object’s heap. So caching the last lookup result improved performance dramatically.
- It is now possible to enable debugging of the GC at runtime, using environment variables.
Set RUBY_GC_DEBUG=1 to enable GC debugging. It will print a message to stderr every time the garbage collector is started. - If GC debugging is enabled or if the environment variable RUBY_GC_ALLOC_HEAP_WITH_FILE is set, it will use mmap() on /dev/zero to allocate the Ruby heaps. This makes it possible to see exactly which heap has been made dirty, and by how much. I’m going to use this for further investigations.
- A new method, GC.cow_friendly?, has been added. This allows apps to dynamically determine whether the garbage collector is copy-on-write friendly.
You can download the Ruby 1.8.6 patch here.
In addition, I’ve put my SVN repository online. It’s here:
http://public.railsplugins.net/repos/ruby/
The ‘vendor/’ folder contains the original Ruby source.
The ‘tags/’ folder marks important milestones in my development progress. In chronological order, they are:
- mark-table-1: the first milestone. See my past blog posts for details.
- mark-table-2: the second milestone. See my past blog posts for details. If I remember correctly, this is where I started implementing memory pools.
- mark-table-3: the third milestone. If I remember correctly, this is where I optimized the memory pools and performance.
- bitfield-1: the first try using bit fields instead of sets.
- bitfield-2: the second try using bit fields instead of sets. If I remember correctly, this contains more optimizations than the previous milestone.
The ‘trunk/’ folder is the latest version.

Andrew Cholakian said,
January 14, 2008 @ 4:24 pm
Thanks for all the awesome work. Do you have any plans to port this over to YARV or Rubinius at some point in the future, or are those already copy-on-write friendly?
Hongli said,
January 14, 2008 @ 4:38 pm
Thanks Andrew.
I’m pretty sure YARV is not copy-on-write friendly. From what I’ve seen, their garbage collector hasn’t changed much. As for Rubinius, I don’t think it’s as easy as “porting”. Rubinius has a completely different codebase. I could talk to them about this issue though.
That said, I’m having a difficult time figuring out how to contact them. Rubinius doesn’t seem to have a mailing list, and I can’t find good contact details on their Lighthouse pages.
Chuck Vose said,
January 16, 2008 @ 11:35 pm
Find Eric Hodel or Ryan Davis on the internet. They’re both working on it now I believe. Also, I forget who’s running this blog but it’s primarily about Rubinius so they should know how to get ahold of them. Or try contacting Engine Yard about it, iirc they’re paying the bill for rubinius.
Hope that helps, I’m excited about all your work and would love to see it in 1.9 and Rubinius.
Making Ruby’s garbage collector copy-on-write friendly, part 7 said,
January 19, 2008 @ 3:51 pm
[…] Making Ruby’s garbage collector copy-on-write friendly, part 7 The garbage collector is now 20% faster! For every object to be marked or sweeped, the garbage collector has to find the heap that an object belongs to. Empirical evidence shows that the chance is very high that the current object … […]
Daniel Berger said,
January 20, 2008 @ 11:43 pm
Got a diff for 1.8.6 p111 by any chance?
Daniel Berger said,
January 21, 2008 @ 5:26 am
Never mind, got it applied to 1.8.6. I’m seeing a 20% speed hit with this patch.
Yukihiro "Matz" Matsumoto said,
February 4, 2008 @ 9:20 am
Is it OK for you if I’d merge this patch into 1.9?
Hongli said,
February 4, 2008 @ 12:17 pm
Sure. In fact, I would be honored if you do.
Could you notify me when you’ve done so? You can find my contact details at the “About me” section.
Randy Parker said,
February 5, 2008 @ 8:22 pm
First, thank you very much for your excellent work on COW-friendly Ruby.
Of course, you don’t need thanks from me, after Matz’s request yesterday! I hope he can incorporate your improvements. Congratulations!
I have a different question: where does all the memory go in a Ruby server? I’ve started looking at ‘pmap’ output of a Mongrel pid on Ubuntu, and immediately notice how much memory is mapped in three surprisingly large anonymous segments:
ruby1.8.so - 22.9 M
nkf.so - 6.3 M (Kanji support)
syck.so - 7.3 M (YAML parsing)
I may try pmap on Solaris to get better understanding of exactly why so much mongrel (or thin) memory is private. Generally, Solaris has always had the most detailed and useful VM tools.
Anyhow, just looking at libruby1.8.so:
$ size /usr/lib/libruby1.8.so /usr/lib64/libruby1.8.so
text data bss dec hex filename
859760 15236 121328 996324 f33e4 /usr/lib/libruby1.8.so
859760 15236 121328 996324 f33e4 /usr/lib64/libruby1.8.so
Nothing here is surprising - a meg of code, 15K of initialized data, and ten times as much un-intialized.
So why are 23M mapped anonymous at runtime?
And why do ‘nkf’ and ’syck’ map their own anon segs? (actually, not having any Kanji content, I’d just as soon leave ‘nkf’ out !)
I’ve not looked into this yet, and am running it by you because it has been a decade since I paid much attention to memory. But, like you, I’m just surprised at how much memory these servers consume. Why? It just seems like far more memory than I’d expect. Perhaps you’ve already noticed this, and it makes sense to you?
Randy Parker said,
February 5, 2008 @ 10:34 pm
It looks like the 3 big segs contain the optrees for all the framework code for Ruby, Syck, and NKF. And I guess that for a dynamic language, that can be modified at runtime, you have to treat them as private. Damn. From my C days, I had assumed all the code would be in the ‘text’ seg. For a dynamic language, it must be put in the heap.
I can see routine module loading modifying the optree as the server starts up, but do most actual Rails apps really change it after launch? To make these segs forking cow-friendly, their mapping would have to become read-only before forking. Then any attempt to modify the optree would throw an exception.
Roger Pack said,
February 15, 2008 @ 5:22 am
I think there is a good potential for speed up of the GC if you have the ‘free’ list and also create a ‘used’ list and work with them appropriately. It would take more RAM but then you wouldn’t have to drag every page in from swap in order to see if every single segment within it is free. Or maybe this work already does something like this?
This work looks pretty good, too. Thanks for the good work! Next up: figuring out an even better GC overall algorithm.
-Roger
roger said,
March 20, 2008 @ 6:21 am
Shoot can’t seem to find the link for the rubyforge project of the “optimized GC build” for Ruby. Anyway my suggestions for such would be to have a small, fixed heap size (use binary sort as discussed on core), to cut down on memory use, as well as allowing for COW friendliness. This would be a nice addition to Ruby.
Some other things that would be nice:
ability to arbitrarily and “dangerously” collect an existing object.
optional ability to set the memory limit after which GC occurs (so you can set it higher–but at will).
Ability (on macintosh) to compile with dtrace support built in (https://rubyforge.org/projects/rubyport-dtrace/ has the patches).
Support for the profiling patch by Joyeux (counts memory allocations per function call).
That would rock! Thanks for doing this stuff.
Take care.
-R
Hongli said,
March 20, 2008 @ 11:08 am
Thanks, I’ll take a look at them.
roger said,
March 20, 2008 @ 8:33 pm
In retrospect I think the following would allow for “faster” garbage collection:
Make all allocated “heap chunks” have a fixed maximum length (say 20K or 200K or whatever you’d like).
Now increase the size of each ruby object just large enough to contain an “index number” for where it lies within each “heap chunk” (like “I’m the 7th ruby object within this chunk!” — this is useful to be able to lookup in O(1) the beginning of the heap chunk, which is where we could store some useful information, and look it up fast.
So with this type of setup it is easy to see how you’d have O(1) lookup for say…the bit field where you are keeping track of “live versus no longer alive” objects. And it wouldn’t take too much memory–maybe 2B extra per ruby object, increasing the total size 10%, but keeping the GC fast.
I’m not sure if this would be as fast as the current patches or not. Perhaps similar, and allowing for cow, but using more RAM.
Now here’s where you could get a real speedup (bear with me–I’m actually not sure if this would work or not, but am thinking out loud).
Our objective here is to have a “free list” of objects (which exists) and to also add an “non free list” (”allocated” list). If we had such a thing as a “allocated” linked list then I think (just thinking out loud) that after you have traversed the stack and marked objects as “live” you would only need to traverse the “allocated” list and remove from it any objects that are no longer live. So say you are making “80% use” of your allocated memory–you would thence avoid traversing 20% of it during GC. Which might save time. I am “guessing” that something like that would save on overall GC time.
If it would, then here is a thought on implementation of such a beast efficiently within RAM.
Besides the index notd above, also add to each ruby object another index–this one to mean “next in live list within this heap chunk” — so another 2 bytes.
Use these new second indices to basically create a linked list, within the heap chunk, of the live objects.
When an object is collected, you could also have it link to its neighbors in such a way–create a “free list” just within the chunk. Not sure if that would be useful, except to guarantee that you have some sort of locality within the free list.
Now the question is on how to determine if you should collect a “heap chunk” at the end of a GC cycle.
You could add, at the beginning of each heap chunk, a counter of how many of its members are free, how many live. Or a pointer to where this number would be located or something similar. Some way to keep count “between GC cycles” or over time.
Just update this number as you free objects within the chunk.
After a GC, iterate through heap chunks and check if any of them have 0. If so, then free them to the OS.
Should this work, it could result in a speedup.
Thoughts?
Take care.
-R
Hongli said,
March 20, 2008 @ 9:13 pm
Hi Roger.
I’ve considered all those ideas in the past. But they all trade time with space. Each object would not be 2 bytes larger, but 4 bytes. A 2 bytes (16-bit) integer will impose a hard limit of 65k slots per heap, and I’m not sure whether that’s desirable. And then there’s also the problem with alignment. Integers on some objects wouldn’t be aligned to integer sizes.
So if we use 4-byte index fields, there will be a 20% memory overhead. I don’t think that’s acceptable, Rails app servers are already large enough. A “used list” will make everything another 20% bigger. I’m not sure whether all that extra space is worth the tiny performance gains.
roger said,
March 21, 2008 @ 7:13 pm
What you have said is true–you’re current implementation sacrifices speed for size, this one would sacrifice size for speed.
I’m not sure who would find which useful.
Some other points to consider, just thinking out loud
1) If you search the ruby-core list for comments on the GC, you’ll notice that some people have found that having smaller “heap chunks” allows them to be freed more easily back to the OS–which is a good thing. You mentioned the 65K ruby objects per chunk, or 1.3M heap chunks (which is probably reasonably large). However, it would increase the number of them, making constant time reference to their bit fields more desirable. In my own experience I have found that doing so reduces the size of rails’ apps and makes them faster (for whatever reason).
2) I wonder if, when searching the stack, the GC takes into account ‘cacheable’ lookup of which “heap chunk” it belongs to. Hmm.
3) It is conceivable that, though you’ll have a 20% memory overhead, you’ll not traverse it all, which might help with locality, so perhaps some memory blocks could be swapped to disk on a more or less permanent basis, saving actual used memory (which is what we care about).
Tough to tell without having the two side-by-side what the real effects are.
I believe the main thought I would make is that if you could convince the Ruby community to swallow the 20% memory overhead (because of speed, if it turns out to be significant lots of people would use it) then you could get, almost for free, cow-friendliness with almost no slow-down, and the ability to have arbitrary sized “heap chunks” which would enable you to use smaller ones, which would save on overall memory used. So basically at that point your cow-friendliness would be used by the entire community, instead of just those who opt to use it. Or, a better ruby overall.
In reality, though, really we NEED a generational GC. That’s just the honest truth. No matter how we tweak it, we’re still going to be traversing all of rails’ 30MB+ of allocated memory per GC, which will almost always be a speed hit.
Maybe Boehm?
Anyway good luck! Thanks for the optimization work!
-R
roger said,
March 21, 2008 @ 11:29 pm
sorry to spam–also nice for the ‘optimized ruby’ build would be if it displayed the full stack trace when an exception is raised and uncaught. Everybody wishes it had that
Cheers.
-R
Timo said,
April 2, 2008 @ 3:02 am
Thanks for creating this patch. I’ve been doing some benchmarks comparing this patch to a normal install, our old GC patch, 1.9 (from yesterday), and with ptmalloc3.
It seems this patch drops the speed by ~10% with a memory savings of ~30% for additional processes. My setup gives 100mb base, and 60mb extra per processes. With this patch, it drops to 38mb extra per process.
My benchmarks show that switching to ptmalloc3 is just a bad idea all around, and that Ruby 1.9 seems to be worse than 1.8 in all cases.
Timo Ewalds said,
April 3, 2008 @ 11:41 pm
By the way, switching to jemalloc (taken from the firefox3 source) instead of glib malloc or ptmalloc3 gives a huge memory advantage. For comparison to the above comment, the base memory is the same around 100mb, but it drops from 60mb per process for vanilla ruby and 38mb with just the above gc patch, all the way down to 23mb per process with the gc patch and jemalloc together. The cpu hit is 20% compared to 10% with just the gc patch, but for me, that’s worth it. It may be possible to rescue some of that cpu drain by making jemalloc unthreaded, but I haven’t put in that effort yet.
Roger Pack said,
April 30, 2008 @ 3:14 am
there is also nedmalloc, it appears