Making Ruby’s garbage collector copy-on-write friendly

Remember that I said that I give up? Well, I lied. ;) Something is driving me. I guess I loathe the idea of having to pay an extra $20 per month for the extra RAM to run more Rails apps.

Sleep has to be the best thing ever invented. After a night of sleep and some careful thinking, I managed to modify Ruby’s garbage collector to use a marking table instead of marking the objects directly. The patch seems to work: ‘make test’ and ‘make install’ both run fine (both run a non-trivial Ruby script). Rubygems, which is notorious for using lots of memory and CPU, runs fine as well. Ruby on Rails also seems to work fine.

The Ruby 1.8.6 patch is here.

How it works

Every Ruby object is represented by a C struct, and always begins with the following field:

unsigned long flags;

Ruby uses mark and sweep garbage collection. First it goes through all objects and marks every object which is in use. This is done by setting the FL_MARK flag on the aforementioned field. Then it goes through all objects again, and frees every object which is not marked. The problem was that this breaks copy-on-write semantics: it effectively marks nearly all pages as dirty. Using a mark table, which keeps the marking information in a separate memory region, is supposed to solve that problem, as Dough Beaver mentioned in this slide. My patch modifies the garbage collector to do just that.

We’re not done yet

Unfortunately things aren’t that simple (are they ever simple? :( ). I notice that the memory gains aren’t as good as I hoped. So I wrote the following test script:

Ruby
  1. list = []
  2. 500000.times do
  3.         list < < "foo"
  4. end
  5.  
  6. pid = fork do
  7.         puts "forked"
  8.         STDIN.readline
  9.  
  10.         ObjectSpace.garbage_collect
  11.         puts "garbage collected"
  12.         STDIN.readline
  13.         exit!
  14. end
  15. Process.waitpid(pid)

I use the GNOME System Monitor to monitor the processes' memory usage. The 'Writable memory' column is especially interesting, as that column tells me how much memory is really private to that process (instead of being shared with the parent/child process). Immediate after forking, the child process's VM size is 29.1 MB, and the writable memory is 76 KB. After running the garbage collection though, the writable memory jumps to 16.9 MB.

Something is definitely wrong here! I know that nothing will be freed - all strings that I put in the list array should never be freed as they are still reachable.

After intensive debugging, I found out that the marking table is using quite some memory. The marking table uses the hash table implementation as provided by Ruby's st.c: st_table. st_table uses malloc() to allocate entries for the hash table. I noticed that each entry uses 24 bytes (this size is including malloc overhead). I have more than 500,000 strings in the list, so the marking table should contain at least 500,000 entries after the marking phase. 500,000 * 24 = 12,000,000. Uh oh, the marking table uses 12 MB of memory during the marking phase. :(

So the main issue now is to find a way to reduce the marking table's memory usage. I could port Google's sparse_hash to C. sparse_hash is a bit slower than a regular hash table, but has only 2 bits of overhead per entry. This would theoretically reduce memory usage from 12 MB to (32 * 500,000 + 2 * 500,000) / 8 bytes = 2 MB.

Does anyone have any better ideas?

7 Comments »

  1. Pratik said,

    July 25, 2007 @ 2:05 pm

    I think sparse_hash sounds like a good trade-off of speed/memory usage. Also, it’d be awesome if you could post pasties of “pmap -d ” output. I don’t really know about “GNOME System Monitor” and only thing I trust with calculating memory usage is pmap :)

    You’re super cool in any case (even if you fail) ! Amazing job.

    Thanks.

  2. Hongli said,

    July 25, 2007 @ 2:22 pm

    Thanks. :)

    Sure, here’s the pmap output just after forking:

    13639:   ./trunk/ruby test.rb
    Address   Kbytes Mode  Offset           Device    Mapping
    08048000     712 r-x-- 0000000000000000 008:00006 ruby
    080fa000       4 rw--- 00000000000b2000 008:00006 ruby
    080fb000   16828 rw--- 00000000080fb000 000:00000   [ anon ]
    b714b000    7160 rw--- 00000000b714b000 000:00000   [ anon ]
    b7849000    2104 rw--- 00000000b7bdd000 000:00000   [ anon ]
    b7c45000    1188 rw--- 00000000b7c45000 000:00000   [ anon ]
    b7d6e000    1260 r-x-- 0000000000000000 008:00011 libc-2.5.so
    b7ea9000       4 r---- 000000000013b000 008:00011 libc-2.5.so
    b7eaa000       8 rw--- 000000000013c000 008:00011 libc-2.5.so
    b7eac000      12 rw--- 00000000b7eac000 000:00000   [ anon ]
    b7eaf000     148 r-x-- 0000000000000000 008:00011 libm-2.5.so
    b7ed4000       8 rw--- 0000000000024000 008:00011 libm-2.5.so
    b7ed6000       4 rw--- 00000000b7ed6000 000:00000   [ anon ]
    b7ed7000      20 r-x-- 0000000000000000 008:00011 libcrypt-2.5.so
    b7edc000       8 rw--- 0000000000004000 008:00011 libcrypt-2.5.so
    b7ede000     156 rw--- 00000000b7ede000 000:00000   [ anon ]
    b7f05000       8 r-x-- 0000000000000000 008:00011 libdl-2.5.so
    b7f07000       8 rw--- 0000000000001000 008:00011 libdl-2.5.so
    b7f1f000      16 rw--- 00000000b7f1f000 000:00000   [ anon ]
    b7f23000     100 r-x-- 0000000000000000 008:00011 ld-2.5.so
    b7f3c000       8 rw--- 0000000000019000 008:00011 ld-2.5.so
    bffa4000      84 rw--- 00000000bffa4000 000:00000   [ stack ]
    ffffe000       4 r-x-- 0000000000000000 000:00000   [ anon ]
    mapped: 29852K    writeable/private: 27596K    shared: 0K

    And this is the output after garbage collection:

    13639:   ./trunk/ruby test.rb
    Address   Kbytes Mode  Offset           Device    Mapping
    08048000     712 r-x-- 0000000000000000 008:00006 ruby
    080fa000       4 rw--- 00000000000b2000 008:00006 ruby
    080fb000   18940 rw--- 00000000080fb000 000:00000   [ anon ]
    b714b000    7160 rw--- 00000000b714b000 000:00000   [ anon ]
    b7849000    2104 rw--- 00000000b7bdd000 000:00000   [ anon ]
    b7c45000    1188 rw--- 00000000b7c45000 000:00000   [ anon ]
    b7d6e000    1260 r-x-- 0000000000000000 008:00011 libc-2.5.so
    b7ea9000       4 r---- 000000000013b000 008:00011 libc-2.5.so
    b7eaa000       8 rw--- 000000000013c000 008:00011 libc-2.5.so
    b7eac000      12 rw--- 00000000b7eac000 000:00000   [ anon ]
    b7eaf000     148 r-x-- 0000000000000000 008:00011 libm-2.5.so
    b7ed4000       8 rw--- 0000000000024000 008:00011 libm-2.5.so
    b7ed6000       4 rw--- 00000000b7ed6000 000:00000   [ anon ]
    b7ed7000      20 r-x-- 0000000000000000 008:00011 libcrypt-2.5.so
    b7edc000       8 rw--- 0000000000004000 008:00011 libcrypt-2.5.so
    b7ede000     156 rw--- 00000000b7ede000 000:00000   [ anon ]
    b7f05000       8 r-x-- 0000000000000000 008:00011 libdl-2.5.so
    b7f07000       8 rw--- 0000000000001000 008:00011 libdl-2.5.so
    b7f1f000      16 rw--- 00000000b7f1f000 000:00000   [ anon ]
    b7f23000     100 r-x-- 0000000000000000 008:00011 ld-2.5.so
    b7f3c000       8 rw--- 0000000000019000 008:00011 ld-2.5.so
    bffa4000      84 rw--- 00000000bffa4000 000:00000   [ stack ]
    ffffe000       4 r-x-- 0000000000000000 000:00000   [ anon ]
    mapped: 31964K    writeable/private: 29708K    shared: 0K
  3. Andrew said,

    July 25, 2007 @ 3:04 pm

    I don’t have any better ideas, but I wish you the best of luck Hongli. Good work.

  4. Jamie Flournoy said,

    July 25, 2007 @ 4:57 pm

    This is really exciting stuff! Keep up the good work.

  5. Giles Morant said,

    July 27, 2007 @ 9:50 am

    Can you give a run-down of why 24 bytes are required for each entry? Could compression (of any sort) be of use?

    If malloc is the primary driver of this, how about chunking the table, much like Ruby does: e.g. the first time an entry is required, 100kb is put aside for it; the next chunk could be double etc..

    Otherwise, this is a problem which must have been overcome before- what does Perl/Python/Java (JRuby?) do??

    This is really interesting stuff- Ruby/Mongrel is quite intensive on memory, and this could be a winner.

  6. Hongli said,

    July 27, 2007 @ 3:45 pm

    Because a hash table entry is defined as follows:

    struct st_table_entry {
        unsigned int hash;
        st_data_t key;
        st_data_t record;
        st_table_entry *next;
    };
    

    Those are 4 fields, 4 bytes each, so the size of the struct is 16 bytes. The remaining 8 bytes are malloc() memory management overhead. Since the marking table only needs a set (and not a full hash table), one can save memory by removing the ‘record’ (value) field. The ‘hash’ field also seems redundant. So by modifying st.c I can save 8 bytes per entry, though porting Google sparse_hash is probably even more efficient. And yes, malloc overhead can be solved by using a memory pool.

    Perl and Python use reference counting, so they are not affected by this problem. (Though instead they have other problems, like circular references.) The Sun JVM implementation uses generational garbage collection (a modified form of mark-and-sweep) as far as I know. I’m guessing JRuby uses JVM’s garbage collection.
    Java is not at all affected by this problem. I’m not sure how copy-on-write friendly the Sun JVM is, but Java apps cannot fork and don’t need fork, because Java apps cannot corrupt memory and because Java has excellent threading support. When a Java thread crashes, the others are unaffected.

  7. Hongli said,

    July 27, 2007 @ 8:50 pm

    I’m baffled. I implemented a lighweight set structure based on st.c, but the memory saving is only 500 KB instead of the expected 3.8 MB. :(

RSS feed for comments on this post · TrackBack URI

Leave a Comment