Making Ruby’s garbage collector copy-on-write friendly, part 3
I wrote my own memory pool implementation, for allocating entries in the pointer set/marking table. I also modified to garbage collector’s sweep routine to only modify some flags if they need to be modified (so that that part doesn’t mark pages unnecessarily as dirty). The garbage collector will also reset the marking table after a sweep - that is, deallocate all memory used by the marking table. Finally, I modified mongrel_light_cluster to perform a forced garbage collection just before forking. That will prevent some things to be garbage collected after forking (which will mark pages dirty).
My work is finally showing some fruits.
If I run the following test script in standard Ruby
-
GC.disable
-
list = []
-
500000.times do
-
list < < "foo"
-
end
-
-
pid = fork do
-
pid = fork do
-
puts "forked"
-
STDIN.readline
-
-
GC.enable
-
if ObjectSpace.respond_to?(:garbage_collect!)
-
ObjectSpace.garbage_collect!
-
else
-
ObjectSpace.garbage_collect
-
end
-
puts "garbage collected"
-
STDIN.readline
-
exit!
-
end
-
Process.waitpid(pid)
-
exit!
-
end
-
Process.waitpid(pid)
then the writable memory after “garbage collected” is printed, is 14.5 MB. When run in my patched Ruby, the writable memory is 320 KB!
According to some benchmarks, performance has improved by 15% compared to the last implementation (which used the system’s malloc()/free()).
Download the Ruby 1.8.6 patch here.
Technical details
The memory pool contains multiple memory chunks. Each memory chunk has a fixed size, and cannot grow. If the memory pool notices that a memory chunk is full, it will allocate a new one. If an item is deallocation, empty memory chunks are freed. The memory pool has 4 bytes (1 pointer) overhead for every allocated memory item. Allocation time is O(1) and deallocation time is O(n), where n is the number of chunks (a chunk contains 4096 items by default). This can probably be faster if the garbage collector can tell the marking table to build an index of chunks.
For those who don’t know, there are two ways that the standard C library’s malloc() can allocate memory:
- Allocate in some internal memory pool. However, if you free() this memory, then that memory is not released to the operating system. This is the case if malloc() is given the command to allocate objects less than 76 bytes.
- Allocate with mmap(). This is the case if malloc() is given the command to allocate objects of 76 bytes or more. If you free() such memory, then that memory *is* released back to the operating system.
(I’m assuming a GNU/Linux system with glibc, on x86. I don’t know how C libraries on other operating systems or architectures behave.)
The pointer set contains many 8-byte entries, so if I use malloc() then all that memory is never released back to the operating system even if I deallocate them. By using a memory pool, which allocates big chunks of memory at once, I force malloc to use mmap(). Therefore, all memory allocated for the pointer set entries can be released to the operating system. The memory used by the bucket array is not affected by this, since the bucket array is usually quite large, so malloc() already uses mmap() to allocate memory for the bucket array.
We’re (still) not done yet!
Consider this second test script:
-
GC.disable
-
require ‘rubygems’
-
gem ‘rails’
-
require ’set’
-
require ‘pathname’
-
require ‘cgi’
-
require ‘initializer’
-
-
GC.enable
-
ObjectSpace.garbage_collect
-
GC.disable
-
pid = fork do
-
pid = fork do
-
puts "forked"
-
STDIN.readline
-
-
GC.enable
-
if ObjectSpace.respond_to?(:garbage_collect!)
-
ObjectSpace.garbage_collect!
-
else
-
ObjectSpace.garbage_collect
-
end
-
puts "garbage collected"
-
STDIN.readline
-
exit!
-
end
-
Process.waitpid(pid)
-
exit!
-
end
-
Process.waitpid(pid)
It loads some code, then forks and runs garbage collection. When run in standard Ruby, then the final writable memory is 4.7 MB. When run in my patched Ruby, the final writable memory is 4.1 MB. While this is an improvement, it’s not much. Most of the memory occupied by the first test script is in strings. In this second test script, most of the memory is occupied by code (i.e. the AST). The garbage collector seems to write to AST nodes even when it’s not necessary.
I also tested mongrel_light_cluster. After mongrel_light_cluster has forked, each child has about 15 MB of writable memory. After loading some pages, their writable memory jump to 17 MB. Compared to the 26 MB used by regular mongrel_cluster, this certainly is an improvement. But I’m still not satisfied. I’m pretty sure this can be optimized further. Somewhere, somehow, pages are being marked as dirty. Finding them (and finding a solution for them) will be hard.
