Making Ruby’s garbage collector copy-on-write friendly, part 3

I wrote my own memory pool implementation, for allocating entries in the pointer set/marking table. I also modified to garbage collector’s sweep routine to only modify some flags if they need to be modified (so that that part doesn’t mark pages unnecessarily as dirty). The garbage collector will also reset the marking table after a sweep – that is, deallocate all memory used by the marking table. Finally, I modified mongrel_light_cluster to perform a forced garbage collection just before forking. That will prevent some things to be garbage collected after forking (which will mark pages dirty).

My work is finally showing some fruits. :) If I run the following test script in standard Ruby

Ruby
  1. GC.disable
  2. list = []
  3. 500000.times do
  4.         list < < "foo"
  5. end
  6.  
  7. pid = fork do
  8.         pid = fork do
  9.                 puts "forked"
  10.                 STDIN.readline
  11.  
  12.                 GC.enable
  13.                 if ObjectSpace.respond_to?(:garbage_collect!)
  14.                         ObjectSpace.garbage_collect!
  15.                 else
  16.                         ObjectSpace.garbage_collect
  17.                 end
  18.                 puts "garbage collected"
  19.                 STDIN.readline
  20.                 exit!
  21.         end
  22.         Process.waitpid(pid)
  23.         exit!
  24. end
  25. Process.waitpid(pid)

then the writable memory after “garbage collected” is printed, is 14.5 MB. When run in my patched Ruby, the writable memory is 320 KB! :) According to some benchmarks, performance has improved by 15% compared to the last implementation (which used the system’s malloc()/free()).

Download the Ruby 1.8.6 patch here.

Technical details

The memory pool contains multiple memory chunks. Each memory chunk has a fixed size, and cannot grow. If the memory pool notices that a memory chunk is full, it will allocate a new one. If an item is deallocation, empty memory chunks are freed. The memory pool has 4 bytes (1 pointer) overhead for every allocated memory item. Allocation time is O(1) and deallocation time is O(n), where n is the number of chunks (a chunk contains 4096 items by default). This can probably be faster if the garbage collector can tell the marking table to build an index of chunks.

For those who don’t know, there are two ways that the standard C library’s malloc() can allocate memory:

  1. Allocate in some internal memory pool. However, if you free() this memory, then that memory is not released to the operating system. This is the case if malloc() is given the command to allocate objects less than 76 bytes.
  2. Allocate with mmap(). This is the case if malloc() is given the command to allocate objects of 76 bytes or more. If you free() such memory, then that memory *is* released back to the operating system.

(I’m assuming a GNU/Linux system with glibc, on x86. I don’t know how C libraries on other operating systems or architectures behave.)

The pointer set contains many 8-byte entries, so if I use malloc() then all that memory is never released back to the operating system even if I deallocate them. By using a memory pool, which allocates big chunks of memory at once, I force malloc to use mmap(). Therefore, all memory allocated for the pointer set entries can be released to the operating system. The memory used by the bucket array is not affected by this, since the bucket array is usually quite large, so malloc() already uses mmap() to allocate memory for the bucket array.

We're (still) not done yet!

Consider this second test script:

Ruby
  1. GC.disable
  2. require 'rubygems'
  3. gem 'rails'
  4. require 'set'
  5. require 'pathname'
  6. require 'cgi'
  7. require 'initializer'
  8.  
  9. GC.enable
  10. ObjectSpace.garbage_collect
  11. GC.disable
  12. pid = fork do
  13.         pid = fork do
  14.                 puts "forked"
  15.                 STDIN.readline
  16.  
  17.                 GC.enable
  18.                 if ObjectSpace.respond_to?(:garbage_collect!)
  19.                         ObjectSpace.garbage_collect!
  20.                 else
  21.                         ObjectSpace.garbage_collect
  22.                 end
  23.                 puts "garbage collected"
  24.                 STDIN.readline
  25.                 exit!
  26.         end
  27.         Process.waitpid(pid)
  28.         exit!
  29. end
  30. Process.waitpid(pid)

It loads some code, then forks and runs garbage collection. When run in standard Ruby, then the final writable memory is 4.7 MB. When run in my patched Ruby, the final writable memory is 4.1 MB. While this is an improvement, it's not much. Most of the memory occupied by the first test script is in strings. In this second test script, most of the memory is occupied by code (i.e. the AST). The garbage collector seems to write to AST nodes even when it's not necessary.

I also tested mongrel_light_cluster. After mongrel_light_cluster has forked, each child has about 15 MB of writable memory. After loading some pages, their writable memory jump to 17 MB. Compared to the 26 MB used by regular mongrel_cluster, this certainly is an improvement. But I'm still not satisfied. I'm pretty sure this can be optimized further. Somewhere, somehow, pages are being marked as dirty. Finding them (and finding a solution for them) will be hard.

8 Comments »

  1. Kevin said,

    October 5, 2007 @ 6:12 pm

    Any more progress on this? I love the work you’re doing. The fact that the ruby core team hasn’t looked at how it behaves under fork() really pisses me off. There are lots of rails processes out there that would run much faster on GBs less of ram if this wasted copy-on-writing was fixed…

  2. Ben Osheroff said,

    October 10, 2007 @ 2:31 am

    Hi, ruby loves to play around with objects on the heap. In my work with getting copy-on-write efficiency, I found that I needed to
    sort the nodes into two groups: those likely to tinkered with later by ruby, and those not). So, I created two types of heaps: immutable and mutable. immutable heaps are filled by ruby until you fork(), at which point ruby tries not to touch them, so as not to dirty the vm page.

    This patch also has a concept of a separate mark table along with immutable/mutable heaps. It’s against 1.8.5 (with some other patches applied, I forget quite what), but it should give you the general idea.

    http://www.gimbo.net/changeset_r7142.diff

  3. JJV said,

    October 11, 2007 @ 3:22 am

    Hi,

    I’ve been looking for a way to reduce mongrel memory usage and came across your blog.

    Would like to know your current status on this project and if you know of anyone that has tried implementing your mongrel_light_cluster and how that has worked for them.

    Thanks.

  4. Hongli said,

    October 11, 2007 @ 10:26 am

    I haven’t worked on this since July. I also tried to do the marking in a separate process in order not to touch the heap of the parent process, but performance degraded literally by a thousand fold (probably because of context switch overhead) so I gave up.

    I’ve tested mongrel_light_cluster + patched Ruby on a small website, and so far it seems to work well (i.e. no stability problems). I’ve seen mongrel_light_cluster fail on Mephisto, and I later found out that it’s because mongrel_light_cluster doesn’t properly support RoR’s config.load_paths. That’s something I should fix some day.

    Ben: your approach is very interesting. Your code doesn’t seem to be compatible with mine, but perhaps I could use some of your ideas. :)

  5. Dan42 said,

    October 12, 2007 @ 4:47 am

    This is all very very interesting. I’m glad I’m not the only one with concerns about ruby’s GC memory performance (or lack thereof). Fortunately, this had been fixed in ruby 1.9. Unfortunately, I’m not sure rails will ever be compatible with 1.9.

    Have you considered using a large bitfield instead of a hash? If you have 3 chunks with n1, n2 and n3 objects you could allocate a (n1+n2+n3)-bit field (only 1 memory allocation!) and use that instead of FL_MARK. I don’t think you can beat 1 bit per object.

  6. Hongli said,

    October 12, 2007 @ 10:12 am

    Dan42, you’re brilliant. Why haven’t I thought of that. :D I might be able to put a marker bit field in each Ruby heap and use that as mark table instead. I’ll give it a try when I have time.

  7. 赖洪礼的 blog » Making Ruby’s garbage collector copy-on-write friendly, part 4 said,

    October 13, 2007 @ 9:54 pm

    [...] previous attempt implements the marking table with a set structure. In part 3, Dan42 said that I could use a bit field instead. The idea did occur to me in the past, but I [...]

  8. Why mod_rails is a really good thing for light-duty Ruby on Rails at Pervasive Code said,

    April 14, 2008 @ 9:30 pm

    [...] interesting alternative was some experimental hacking to MRI’s garbage collector by Hongli Lai, to store its working data separately from the objects being examined, so that preloaded Ruby code [...]

RSS feed for comments on this post · TrackBack URI

Leave a Comment