Making Ruby’s garbage collector copy-on-write friendly
Remember that I said that I give up? Well, I lied.
Something is driving me. I guess I loathe the idea of having to pay an extra $20 per month for the extra RAM to run more Rails apps.
Sleep has to be the best thing ever invented. After a night of sleep and some careful thinking, I managed to modify Ruby’s garbage collector to use a marking table instead of marking the objects directly. The patch seems to work: ‘make test’ and ‘make install’ both run fine (both run a non-trivial Ruby script). Rubygems, which is notorious for using lots of memory and CPU, runs fine as well. Ruby on Rails also seems to work fine.
How it works
Every Ruby object is represented by a C struct, and always begins with the following field:
unsigned long flags;
Ruby uses mark and sweep garbage collection. First it goes through all objects and marks every object which is in use. This is done by setting the FL_MARK flag on the aforementioned field. Then it goes through all objects again, and frees every object which is not marked. The problem was that this breaks copy-on-write semantics: it effectively marks nearly all pages as dirty. Using a mark table, which keeps the marking information in a separate memory region, is supposed to solve that problem, as Dough Beaver mentioned in this slide. My patch modifies the garbage collector to do just that.
We’re not done yet
Unfortunately things aren’t that simple (are they ever simple?
). I notice that the memory gains aren’t as good as I hoped. So I wrote the following test script:
-
list = []
-
500000.times do
-
list < < "foo"
-
end
-
-
pid = fork do
-
puts "forked"
-
STDIN.readline
-
-
ObjectSpace.garbage_collect
-
puts "garbage collected"
-
STDIN.readline
-
exit!
-
end
-
Process.waitpid(pid)
I use the GNOME System Monitor to monitor the processes’ memory usage. The ‘Writable memory’ column is especially interesting, as that column tells me how much memory is really private to that process (instead of being shared with the parent/child process). Immediate after forking, the child process’s VM size is 29.1 MB, and the writable memory is 76 KB. After running the garbage collection though, the writable memory jumps to 16.9 MB.
Something is definitely wrong here! I know that nothing will be freed - all strings that I put in the list array should never be freed as they are still reachable.
After intensive debugging, I found out that the marking table is using quite some memory. The marking table uses the hash table implementation as provided by Ruby’s st.c: st_table. st_table uses malloc() to allocate entries for the hash table. I noticed that each entry uses 24 bytes (this size is including malloc overhead). I have more than 500,000 strings in the list, so the marking table should contain at least 500,000 entries after the marking phase. 500,000 * 24 = 12,000,000. Uh oh, the marking table uses 12 MB of memory during the marking phase.
So the main issue now is to find a way to reduce the marking table’s memory usage. I could port Google’s sparse_hash to C. sparse_hash is a bit slower than a regular hash table, but has only 2 bits of overhead per entry. This would theoretically reduce memory usage from 12 MB to (32 * 500,000 + 2 * 500,000) / 8 bytes = 2 MB.
Does anyone have any better ideas?

Pratik said,
July 25, 2007 @ 2:05 pm
I think sparse_hash sounds like a good trade-off of speed/memory usage. Also, it’d be awesome if you could post pasties of “pmap -d ” output. I don’t really know about “GNOME System Monitor” and only thing I trust with calculating memory usage is pmap
You’re super cool in any case (even if you fail) ! Amazing job.
Thanks.
Hongli said,
July 25, 2007 @ 2:22 pm
Thanks.
Sure, here’s the pmap output just after forking:
And this is the output after garbage collection:
Andrew said,
July 25, 2007 @ 3:04 pm
I don’t have any better ideas, but I wish you the best of luck Hongli. Good work.
Jamie Flournoy said,
July 25, 2007 @ 4:57 pm
This is really exciting stuff! Keep up the good work.
Giles Morant said,
July 27, 2007 @ 9:50 am
Can you give a run-down of why 24 bytes are required for each entry? Could compression (of any sort) be of use?
If malloc is the primary driver of this, how about chunking the table, much like Ruby does: e.g. the first time an entry is required, 100kb is put aside for it; the next chunk could be double etc..
Otherwise, this is a problem which must have been overcome before- what does Perl/Python/Java (JRuby?) do??
This is really interesting stuff- Ruby/Mongrel is quite intensive on memory, and this could be a winner.
Hongli said,
July 27, 2007 @ 3:45 pm
Because a hash table entry is defined as follows:
struct st_table_entry { unsigned int hash; st_data_t key; st_data_t record; st_table_entry *next; };Those are 4 fields, 4 bytes each, so the size of the struct is 16 bytes. The remaining 8 bytes are malloc() memory management overhead. Since the marking table only needs a set (and not a full hash table), one can save memory by removing the ‘record’ (value) field. The ‘hash’ field also seems redundant. So by modifying st.c I can save 8 bytes per entry, though porting Google sparse_hash is probably even more efficient. And yes, malloc overhead can be solved by using a memory pool.
Perl and Python use reference counting, so they are not affected by this problem. (Though instead they have other problems, like circular references.) The Sun JVM implementation uses generational garbage collection (a modified form of mark-and-sweep) as far as I know. I’m guessing JRuby uses JVM’s garbage collection.
Java is not at all affected by this problem. I’m not sure how copy-on-write friendly the Sun JVM is, but Java apps cannot fork and don’t need fork, because Java apps cannot corrupt memory and because Java has excellent threading support. When a Java thread crashes, the others are unaffected.
Hongli said,
July 27, 2007 @ 8:50 pm
I’m baffled. I implemented a lighweight set structure based on st.c, but the memory saving is only 500 KB instead of the expected 3.8 MB.