Archive for Optimization

Optimizing RDoc

When it comes to documentation, there’s always room for improvement. I’ve been contributing to docrails for quite a while now. However, one issue that has always annoyed me is how slow RDoc is.

In particular, generating RDoc output for ActiveRecord is painfully slow. It takes 34.2 seconds on my machine. Using JRuby helps a little, seeing that it can do JIT and all, but not much: it still takes 29 seconds (including JVM startup time). I had hoped for a bigger performance improvement.

So a few days ago I started optimizing RDoc. And not without result. Here are the results of generating RDoc for the Ruby 1.8 standard library with RDoc running in MRI 1.8:
Before optimizing: 1624.8 seconds (27 minutes). I had to close all applications because RDoc ate almost 1.5 GB of memory.
After optimizing: 531.2 seconds (8 minutes and 51 seconds). RDoc now uses only about half the memory it used to need.
Performance improvement: 206%!

rdoc_generation_time_for_ruby_1_8_standard_library.png

The results for generating RDoc for ActiveRecord are:

Before optimizing (MRI 1.8): 34.2 seconds
After optimizing (MRI 1.8): 29.9 seconds
Performance improvement: 14%

Before optimizing (MRI 1.9): 16.8 seconds
After optimizing (MRI 1.9): 13.2 seconds
Performance improvement: 27%

Before optimizing (JRuby): 29.0 seconds
After optimizing (JRuby): 24.5 seconds
Performance improvement: 18%

rdoc_generation_time_for_activerecord.png

So what did I do? Read on!

Multi-threaded parsing

My laptop has a dual core CPU. It’s kind of a waste to see RDoc utilizing only a single core. RDoc can be parallelized by running 2 rdoc processes at the same time. But I’m never generating 2 different sets of RDoc documentation at the same time, so running 2 rdoc processes in parallel won’t do me any good. So I had to search for ways to parallelize a single rdoc process.

It turns out that RDoc’s code parsing phase is pretty easy to parallelize. It just creates a new parser object for every input file, and parses files sequentially. So I modified RDoc to create multiple worker threads during the parsing phase. The main thread will offer a list of filenames, and the worker threads will each consume filenames and parse the corresponding file as fast as they can.

What I did not try to do however, is making the parser itself multi-threaded. That’s just asking for problems. Those of you who are interested in multi-threaded programming, but aren’t experienced with it, should keep this in mind: try to keep your threading code as simple as possible, and make sure that your threads share as little data as possible. In my case, the only things that the threads share are the filename queue and the result aggregation array.

MRI 1.8 implements userspace threads, so having multiple threads doesn’t increase CPU core utilization. MRI 1.9 uses kernel threads, but it has a global interpreter lock, so it can’t utilize multiple CPU cores either. Luckily, JRuby doesn’t have a global interpreter lock. The parsing phase is now about 35% faster, when run on JRuby on my dual core machine.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22555&group_id=627&atid=2474

Cache template pages

For larger projects, the slowest phase of RDoc is probably the HTML generation phase. The latest version of RDoc uses ERB templates for HTML output. RDoc generates an HTML file for every class, every module and every source file. However after some profiling with ruby-prof, I found out that RDoc recompiles the ERB template for every output file, which is slow.

I modified RDoc so that it caches compiled ERB templates. This made it about 19% faster.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22556&group_id=627&atid=2474

Reduce the number of relative URL generations

For every output file, RDoc generates a list of URLs to this output file, relative to the output filenames of all classes, files and methods. This is done by calling the function index_to_links, which runs in O(n) time. For every output file, index_to_links is called 3 times, each time with the full list of classes, files and methods. Suppose there are K output files, L classes, N files and M methods. Then this will result in K * (L+N+M) operations, which is quadratic! Ouch!

The RDoc authors already tried to optimize this by only calling index_to_links when the current output filename’s directory is different from the last one. For example, if the last output file was classes/active_support/cache/memcache_store.html and the current output file is classes/active_support/cache/memory_store.html, then index_to_links doesn’t need to be called again, and its last result can be reused. The problem is that the output filename list isn’t sorted, and so a directory change is detected much more often that it needs to be.

I optimized this by sorting the list. This resulted in an 8% performance improvement.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22565&group_id=627&atid=2474

Reduce garbage and redundant operations in index_to_links

But the story doesn’t end there. index_to_links accepts two arguments: a filename and an array. index_to_links sorts its input array every time it is called. Remember that I said that index_to_links is O(n)? It’s not: it’s actually O(n log(n)) because of the sorting.

The arrays that are passed to index_to_links remains the same every time. So it’s very inefficient to keep sorting them over and over. I optimized this by:

  1. sorting the arrays only once.
  2. passing the sorted arrays to index_to_links.
  3. modifying index_to_links to not perform any sorting.
  4. modifying index_to_links to use in-place array manipulation methods as much as possible, to avoid generating unnecessary garbage objects.

The result is a 14% overall performance improvement.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22557&group_id=627&atid=2474

Failed: parser output caching

Seeing that the parser isn’t very fast, I thought about caching its output. So I modified RDoc in the following manner:

  1. It Marshal.dumps the parser output to a cache file.
  2. It consults the cache file whenever possible, instead of parsing the input file.

This made the parsing phase about 20% slower on a cold run, but 75% faster on a warm run.

Unfortunately, this approach totally fails on large projects. When run on the Ruby 1.8 standard library, it results in cache files that are 6 MB or larger. Loading such a file is much slower than parsing the original source file, and uses more memory too. My laptop crashed during this experiment because Ruby used 2 GB of memory. So I abandoned this effort.

Final words

Take the performance improvement numbers that I’ve given with a grain of salt. These numbers were taken by running RDoc on its own sources, under MRI 1.8. The performance gain really depends on the input. You’ve already seen the difference in performance improvement between running RDoc on the Ruby 1.8 standard library, and running it on ActiveRecord.

Comments (7)

Making Ruby’s garbage collector copy-on-write friendly, part 8

Hi folks, it has been a while since the last “Making Ruby’s garbage collector copy-on-write friendly” post. Many things have happened in the mean time, and my copy-on-write work is now usable (and used) in production environments, but it seems that there is still confusion. So I’ve decided to write a new post which explains the situation.

Copy-on-write updates

In March I submitted my work to the Ruby core mailing list. There has been some discussion. As a result, various people, including myself, have made a number of improvements.

The improvements are as follows:

  • The copy-on-write friendly garbage collector is now a few % faster thanks to various micro-optimizations.
  • The mark table implementation is now pluggable.

    On Windows, a copy-on-write friendly garbage collector is totally useless because fork() is not supported on Windows. Furthermore, not all Ruby applications call fork(). So I’ve made two mark table implementations: one based on the old one (which marks objects directly by setting a flag on the object) and a copy-on-write friendly one. It is now possible to change the mark table implementation during runtime by calling GC.copy_on_write_friendly = (boolean value).

    This has huge performance implications. The copy-on-write friendly mark table makes the garbage collector about 0%-20% slower, depending on the application and the workload. However, the non-copy-on-write friendly mark table is enabled by default, so by default there is only a 1% performance penalty. This performance penalty comes from the fact that marking an object now requires a function call which sets the mark flag, instead of setting the mark flag directly. But I think 1% is acceptable.

  • Various little bugs in the debugging code have been fixed.
  • Me and Ninh are working on a scientific paper regarding the copy-on-write work.

Unfortunately the discussion stranded. Matz had some concerns about performance, which is why I made the mark table implementation pluggable. I will re-submit the patch for further evaluation when the time is right.

Ruby Enterprise Edition

Many of you have probably heard of Ruby Enterprise Edition. There has been, and still is, a lot of fuss about the name. But that’s intentional and is all part of the plan — if people make a fuss about the name then it means we’re not in the Zone of Mediocrity. :)

What is Ruby Enterprise Edition? People thought it’s a closed source product, but in fact the website’s front page and download page has the following huge sticker:

(We actually added this sticker after we’ve seen that people think it’s going to be closed source.)

In one sentence:
Ruby Enterprise Edition is an easy to install Ruby interpreter that includes, among other things, my copy-on-write work.

Facts and myths:

  • It’s open source, not closed source. It’s freely available to all.
  • It’s not an entirely new Ruby implementation. It’s based on the official Ruby interpreter (MRI), version 1.8.6-p286. This means that all your existing Ruby applications are compatible with Ruby Enterprise Edition.
  • It does not only include my copy-on-write work. There’s more. Read on.
  • It is not a hostile fork, but a friendly one. The work included in Ruby Enterprise Edition is meant to be merged back to upstream at some point in the future.
  • The copy-on-write work has been submitted to the Ruby core team in the past.
  • Phusion Passenger is very well-integrated with Ruby Enterprise Edition. If you use Phusion Passenger in combination with Ruby Enterprise Edition, then your Rails applications will transparently use 33% less memory and will be faster, as if it’s magic. You don’t need to do anything special, it just works.

    The only condition is that you must not be using conservative spawning in your application. But if you don’t know what conservative spawning is then you’re not using it, and you’ll have nothing to worry about.

Why was Ruby Enterprise Edition made?

Consider the following facts:

  • My copy-on-write work can potentially save a lot of memory in Rails applications.
  • The patch has been submitted to upstream, but hasn’t been accepted yet.
  • There is a demand for lower memory usage in Rails applications, right now, not X months/years in the future.

Given the circumstances, and to satisfy the demands (including that of ourselves), we have decided that it would be best to maintain our own Ruby fork which includes these patches.

You might be wondering: Why not just release the patch? Why create a fork?

The answer is user friendliness. Telling people to download Ruby’s source code and apply a patch is not user friendly. In fact, to many people, it’s downright scary. Imagine that you want a transparent and easy way to make your Rails applications “magically” use 33% less memory. Which of the following instructions would you prefer?

Use Phusion Passenger to deploy your application. Then download the Ruby interpreter source code from www.ruby-lang.org. Download it and extract the tarball. Then, download this patch and apply it with this and that command. Then, run ‘./configure –prefix=/somewhere’. Make sure that /somewhere is not /usr in order to prevent overwriting your old Ruby installation, you don’t want that to happen. Then type ‘make’, and then ‘sudo make install’. Then download RubyGems, extract it, and type ‘sudo /somewhere/bin/ruby setup.rb’ in the RubyGems source folder. Then type ‘/somewhere/bin/gem install rails’ to install Ruby on Rails and whatever other gems you might need.

or:

Use Phusion Passenger to deploy your application. Then download Ruby Enterprise Edition. Run the installer and follow the instructions. Done.

The first one contains a lot of caveats. Many many things can go wrong. Many many people aren’t experienced in installing Ruby from source. It’s just easier if there’s a vendor that takes care of everything for you. And we are that vendor.

We want Phusion Passenger and everything surrounding it to have a “just works” experience.

So if it’s not just the copy-on-write work, then what else does Ruby Enterprise Edition include?

  • This one is huge: by using Google’s tcmalloc, an alternative memory allocator, Ruby becomes 20% faster even with the copy-on-write friendly garbage collector! Furthermore, tcmalloc seems to be more copy-on-write friendly than ptmalloc2, Linux’s default memory allocator, so by using tcmalloc we can save even more memory!
    We discovered this shortly after submitting the patch to the Ruby core mailing list. So Ruby Enterprise Edition also includes tcmalloc.
  • Ruby Enterprise Edition includes an easy-to-use installer which takes care of installing tcmalloc, Ruby, RubyGems and important/useful gems for you. It also teaches you how to tell Phusion Passenger to use Ruby Enterprise Edition instead of normal Ruby.
  • In the future we might include more patches that might be useful in production environments.

Who’s already using Ruby Enterprise Edition?

I’m not sure because we’ve never asked our users. But the Ruby on Rails Wiki is running on it, and it has been great. I’ve been monitoring the Wiki for a while now, and ever since we’ve switched it to Phusion Passenger + Ruby Enterprise Edition, it has been rock-solid (before, it used to crash often). We also observed a great reduction in memory usage.

Michael Koziarski, a Rails core developer, runs Phusion Passenger with Ruby Enterprise Edition on his blog. He said that he downgraded his server because Phusion Passenger + Ruby Enterprise Edition saved him so much memory.

Final words

I hope this post has shed some light on matters. I’m just a little surprised that there’s all this confusion going on because all of this is also documented on the Ruby Enterprise Edition website’s FAQ. eustaquiorangel.com recently interviewed me and asked similar questions. You should check it out.

I’m also a little surprised that people seem to be reluctant about installing Ruby Enterprise Edition. If I have the choice between two products A and B, and B is the same as A but is much more efficient and is easy to install, then I’d choose B.
It is that people are suspicious about our claims? We’ve published a performance and memory usage comparison. Anybody can read this comparison, perform it himself, and check whether our claims are true. Everything we claim is verifiable so I don’t understand what there is to be suspicious about.

Please feel free to post your thoughts on this, I’d really like to hear what people have to say.

Comments (11)

Making PStore reaaaally fast (and stable)

PStore is a library in Ruby that “implements a file based persistance mechanism based on a Hash”. Ruby on Rails up until version 1.2 uses PStore as the default mechanism for storing sessions. But it’s pretty well-known that PStore “sucks”: people say that it’s slow, causes file corruptions, etc. This is one of the reasons why Rails 2.0 uses the cookie session store by default: it’s faster and needs less maintenance at the cost of a few (easy-to-avoid) caveats.

If you look at the Mongrel FAQ, then you’ll see that 3 items are devoted to telling you that PStore is the work of satan and that any sane web developer should ritually burn it. But this particular sentence caught my eye:

“Other things that can cause big pauses are:
- …
- Locking files wrong. Multiple processes locking files is a delicate thing to do. “

Locking files wrong? How can that possibly be? File locking, just like mutex locking, isn’t really rocket science.

Note that PStore’s RDoc advertises itself as transactional:

“# The transactional behavior ensures that any changes succeed or fail together.
# This can be used to ensure that the data store is not left in a transitory
# state, where some values were upated but others were not.”

I decided to take a look at its source code. What possibly could have gone wrong?

Well, I’m not sure what is wrong with it, but the relevant code, PStore#transaction, is a bit messy. I had a hard figuring out what it is exactly doing and why, but I figured that it does these things:

  • It locks the file.
  • It reads and unmarshals the file.
  • It generates an MD5 checksum of the file.
  • It runs the transaction block, then generates an MD5 checksum of the new contents of the file. The file is only written to if the MD5 checksum or the size of the new content doesn’t match that of the original file.
  • It writes new data to a temp file, then renames that to the original file (or at least, I think that’s what it does; it seems to use 2 temp files files). File renames are atomic on Unix but not Windows. This is why on Windows, one needs to implement file recovery as well. PStore seems to have some code for recovery, but it’s unclear how well that works. It doesn’t match the algorithm given on MSDN.

It’s not clear whether PStore has locked everything correctly. Oh, and PStore isn’t thread-safe (though it is reentrant).

  • PStore is usually used for writing small amounts of data, so calculating an MD5 is definitely not worth it – just write the file already!
  • Writing to a temp file ensures atomicity on Unix, but it adds another system call, and system calls are expensive. If one is able to open the file for writing-and-appending, then writing to it shouldn’t raise any errors except in rare conditions, such as out-of-disk-space conditions or hardware errors. Furthermore, I’ve never seen anybody using PStore for anything other than for session data and other not-so-important stuff, so performance is more important.
  • Aah, Windows…. All the recovery code just to work around the fact that Windows doesn’t support atomic file renames, cause a lot of performance loss.

So I rewrote PStore. The code is now easier to read and is faster. And as far as I know, everything is locked correctly so there shouldn’t be any concurrent issues.

By default, it doesn’t try as hard to ensure file integrity because harddisk I/O errors are very rare, but if file integrity is really an issue, then you can set pstore.ultra_safe_transactions = true to enable it. This option only has effect on Unix though: as of now, I haven’t bothered writing complex recovery code for Windows.

This is the benchmark program that I used:
http://pastie.org/170997

The results are as follows:
benchmark.png

As you can see, PStore has become 2 times of even 3 times faster, depending on whether is ultra_safe_transactions is enabled! But let’s see what kind of effect it has on a dummy Rails 2.0 app that uses the PStore session store:

  • Dummy Rails 2.0 app, using original PStore library: 170.22 requests/sec
  • Dummy Rails 2.0 app, using my PStore library: 196.72 requests/sec

A small performance increase. :) (Though the cookie session store is still faster: 221.74 requests/sec.)

You can download it here:
http://pastebin.com/f163702f2

If you want your Rails app to make use of it, simply save it as “lib/pstore.rb”. For maximum stability, Rails’s PStore handler should be modified to ignore any errors encountered during unmarshalling.

One of these days I’ll send a patch to ruby-core so that this can be merged back upstream. But for now Passenger (a.k.a. mod_rails) has priority.

Comments (14)

Making Ruby’s garbage collector copy-on-write friendly, part 7

It turns out that part 6 was not final. I’ve been working on the garbage collector again. These are the changes:

  • The garbage collector is now 20% faster! For every object to be marked or sweeped, the garbage collector has to find the heap that an object belongs to. Empirical evidence shows that the chance is very high that the current object belongs to the same heap as the last object’s heap. So caching the last lookup result improved performance dramatically.
  • It is now possible to enable debugging of the GC at runtime, using environment variables.
    Set RUBY_GC_DEBUG=1 to enable GC debugging. It will print a message to stderr every time the garbage collector is started.
  • If GC debugging is enabled or if the environment variable RUBY_GC_ALLOC_HEAP_WITH_FILE is set, it will use mmap() on /dev/zero to allocate the Ruby heaps. This makes it possible to see exactly which heap has been made dirty, and by how much. I’m going to use this for further investigations.
  • A new method, GC.cow_friendly?, has been added. This allows apps to dynamically determine whether the garbage collector is copy-on-write friendly.

You can download the Ruby 1.8.6 patch here.

In addition, I’ve put my SVN repository online. It’s here:

http://public.railsplugins.net/repos/ruby/

The ‘vendor/’ folder contains the original Ruby source.
The ‘tags/’ folder marks important milestones in my development progress. In chronological order, they are:

  • mark-table-1: the first milestone. See my past blog posts for details.
  • mark-table-2: the second milestone. See my past blog posts for details. If I remember correctly, this is where I started implementing memory pools.
  • mark-table-3: the third milestone. If I remember correctly, this is where I optimized the memory pools and performance.
  • bitfield-1: the first try using bit fields instead of sets.
  • bitfield-2: the second try using bit fields instead of sets. If I remember correctly, this contains more optimizations than the previous milestone.

The ‘trunk/’ folder is the latest version.

Comments (22)

Making Ruby’s garbage collector copy-on-write friendly, part 6 (final?)

I added some minor optimizations and cleaned up the code a little bit. I also fixed many bugs in mongrel_light_cluster. It’s time for another test round.

Test plan

The goal is to measure the memory usage of my Ruby GC patch + mongrel_light_cluster, compared to default Ruby and mongrel_cluster. The test plan is as follows:

  • Measure current memory usage.
  • Start 10 instances of mongrel_cluster.
  • Measure current memory usage.
  • Load the front page of all 10 mongrel_cluster instances 50 times.
  • Measure current memory usage.
  • Shutdown mongrel_cluster, and repeat the above steps with patched Ruby + mongrel_light_cluster.

Software used in this test

Ruby on Rails 1.2.5
Mongrel 1.0.1
mongrel_cluster 1.0.2
Mephisto 0.7.3

Environment: production
config.action_controller.perform_caching = false

Memory usage of 10 mongrel_cluster instances

Before mongrel_cluster is started:

             total       used       free     shared    buffers     cached
Mem:          1011        576        435          0         16        185
-/+ buffers/cache:        373        637
Swap:          996         13        982

After starting 10 instances of mongrel_light_cluster:

             total       used       free     shared    buffers     cached
Mem:          1011        861        149          0         16        185
-/+ buffers/cache:        659        352
Swap:          996         13        982

After downloading the front page 50 times from all instances:

             total       used       free     shared    buffers     cached
Mem:          1011        901        109          0         16        186
-/+ buffers/cache:        698        312
Swap:          996         13        982

Conclusion:
10 instances of mongrel_cluster will use 435-109 = 326 MB memory.
So mongrel_cluster uses 32.6 MB per instance.

Memory usage of 10 mongrel_light_cluster instances with patched Ruby

Before:

             total       used       free     shared    buffers     cached
Mem:          1011        587        424          0         18        193
-/+ buffers/cache:        375        635
Swap:          996         13        982

After starting 10 instances:

             total       used       free     shared    buffers     cached
Mem:          1011        752        259          0         18        193
-/+ buffers/cache:        540        470
Swap:          996         13        982

After downloading the front page 50 times from all instances:

             total       used       free     shared    buffers     cached
Mem:          1011        803        207          0         18        193
-/+ buffers/cache:        591        419
Swap:          996         13        982

Conclusion:
10 instances of mongrel_light_cluster uses 424-207 = 217 MB memory.
So 1 mongrel_light_cluster instance uses 21.7 MB.

Overall conclusion

Everybody likes graphs, so here is one:
copy-on-write-friendly-ruby-memory-usage.png
Memory usage of an instance went down from 32.6 MB to 21.7 MB. A 33% saving! :)
This seems to be the end of the road. I can’t seem to be able to add more optimizations.

You can download the latest Ruby 1.8.6 patch here.
You can install mongrel_light_cluster by typing:

svn checkout http://public.railsplugins.net/repos/mongrel_light_cluster/trunk mongrel_light_cluster
cd mongrel_light_cluster
rake
gem install pkg/mongrel_light_cluster-0.1.gem

Please read the README for further documentation.

Finally, I’d like to ask your help:

  • I want to turn this patch and mongrel_light_cluster into a real RubyForge project. Can anyone help me with the website and documentation?
  • Is there anyone willing to test my patch and mongrel_light_cluster? I’m not confident enough to use these on my business website (which handles payments) but I don’t have any other Rails sites that are large enough.

Please post here if you’re interested.

Comments (13)

Making Ruby’s garbage collector copy-on-write friendly, part 5

It seems that the garbage collector doesn’t make pages dirty after all, and that malloc() seems to be the problem.

I inserted the following test code at the beginning of the garbage_collect() function:

C
  1. if (debugging) {
  2.     printf("Before 1 KB is allocated.");
  3.     getchar();
  4.     calloc(1024, 1);
  5.     printf("After 1 KB is allocated.");
  6.     getchar();
  7. }

My Ruby test script looks like this:

Ruby
  1. load_ruby_on_rails
  2. ObjectSpace.garbage_collect    # garbage collect before forking to make sure stuff isn’t freed after forking
  3. pid = fork do
  4.     ObjectSpace.start_debugging    # sets the ‘debugging’ variable to true
  5.     ObjectSpace.garbage_collect
  6.     exit!
  7. end
  8. Process.waitpid(pid)

Between the messages “Before 1 KB is allocated.” and “After 1 KB is allocated.”, memory usage in the child process jumps from 125 KB to 7 MB!!!

I wrote a test program in C to see whether I can reproduce it outside Ruby:

C
  1. #include <stdio .h>
  2. #include < sys/types.h>
  3. #include < sys/wait.h>
  4. #include <unistd .h>
  5. #include <stdlib .h>
  6.  
  7. int
  8. main() {
  9.         int i;
  10.  
  11.         for (i = 0; i < 20000; i++) {
  12.                 // Allocate some stuff so that the heap isn’t empty.
  13.                 calloc(1024, 1);
  14.         }
  15.  
  16.         pid_t pid = fork();
  17.         if (pid == 0) {
  18.                 getchar();
  19.                 calloc(1024, 1);
  20.                 getchar();
  21.                 _exit(0);
  22.         } else {
  23.                 int status;
  24.                 waitpid(pid, &status, 0);
  25.         }
  26.         return 0;
  27. }

And I couldn’t. In the above test program, memory usage in the child process jumps from 30 KB to 40 KB.

Is there anyone with intimate knowledge about the Linux/glibc malloc() implementation who can tell me what’s going on?

Comments (2)

Making Ruby’s garbage collector copy-on-write friendly, part 4

My previous attempt implements the marking table with a set structure. In part 3, Dan42 said that I could use a bit field instead. The idea did occur to me in the past, but I thought it was infeasible because I thought the bit field will have to cover the entire process’s address space. After some thinking, I realized I was wrong. Thanks Dan42. :D

Each heap_slot struct now has a bit field, which is used as marking table for objects inside that heap:

C
  1. static struct heaps_slot {
  2.     void *membase;
  3.     RVALUE *slot;
  4.     int limit;
  5.     int *marks;        // added this
  6.     int marks_size;    // and this
  7. } *heaps;

The previous marking table implementation had a large overhead: about 8 bytes per marked object. An object (RVALUE) is 20 bytes, so the overhead is 40% in the worst case! And that’s not counting the overhead used by the set structure’s bucket.
The new bit field implementation has a much smaller overhead:
(n / 8 ) / (n * 20) = n / (8 * n * 20) = 1 / (8 * 20) = 1 / 160 = 0.00625
where n is the number of objects on a heap. The bit field only has 0.6% overhead!! :)

According to my benchmarks, garbage collection performance has improved by 60% compared to the previous implementation. It’s still 10%-20% slower than normal Ruby, but I think it’s acceptable.

I think a significant amount of time is spent on locating the heap (and thus the marking table bit field) that an object belongs to. Right now it performs a linear search on all heaps. That’s usually OK though. Each heap is 80% larger than the previous. Ruby on Rails fits in 6 heaps. I think the average application needs no more than 7 heaps. So in the average case, looking up the correct heap should take no more than 7 operations. But it still sucks, so next time I’ll investigate whether I can speed this up by sorting the heaps and using a binary search.

And I’m still not done. Running the garbage collector after a fork still makes a lot of pages dirty, and I still haven’t figured out why.

Anyway…

You can download the Ruby 1.8.6 patch here.

This is brought to you by blood, sweat and tears. I poked myself several times today wondering why my code crashes for mysterious reasons.

Comments (5)

Making Ruby’s garbage collector copy-on-write friendly, part 3

I wrote my own memory pool implementation, for allocating entries in the pointer set/marking table. I also modified to garbage collector’s sweep routine to only modify some flags if they need to be modified (so that that part doesn’t mark pages unnecessarily as dirty). The garbage collector will also reset the marking table after a sweep – that is, deallocate all memory used by the marking table. Finally, I modified mongrel_light_cluster to perform a forced garbage collection just before forking. That will prevent some things to be garbage collected after forking (which will mark pages dirty).

My work is finally showing some fruits. :) If I run the following test script in standard Ruby

Ruby
  1. GC.disable
  2. list = []
  3. 500000.times do
  4.         list < < "foo"
  5. end
  6.  
  7. pid = fork do
  8.         pid = fork do
  9.                 puts "forked"
  10.                 STDIN.readline
  11.  
  12.                 GC.enable
  13.                 if ObjectSpace.respond_to?(:garbage_collect!)
  14.                         ObjectSpace.garbage_collect!
  15.                 else
  16.                         ObjectSpace.garbage_collect
  17.                 end
  18.                 puts "garbage collected"
  19.                 STDIN.readline
  20.                 exit!
  21.         end
  22.         Process.waitpid(pid)
  23.         exit!
  24. end
  25. Process.waitpid(pid)

then the writable memory after “garbage collected” is printed, is 14.5 MB. When run in my patched Ruby, the writable memory is 320 KB! :) According to some benchmarks, performance has improved by 15% compared to the last implementation (which used the system’s malloc()/free()).

Download the Ruby 1.8.6 patch here.

Technical details

The memory pool contains multiple memory chunks. Each memory chunk has a fixed size, and cannot grow. If the memory pool notices that a memory chunk is full, it will allocate a new one. If an item is deallocation, empty memory chunks are freed. The memory pool has 4 bytes (1 pointer) overhead for every allocated memory item. Allocation time is O(1) and deallocation time is O(n), where n is the number of chunks (a chunk contains 4096 items by default). This can probably be faster if the garbage collector can tell the marking table to build an index of chunks.

For those who don’t know, there are two ways that the standard C library’s malloc() can allocate memory:

  1. Allocate in some internal memory pool. However, if you free() this memory, then that memory is not released to the operating system. This is the case if malloc() is given the command to allocate objects less than 76 bytes.
  2. Allocate with mmap(). This is the case if malloc() is given the command to allocate objects of 76 bytes or more. If you free() such memory, then that memory *is* released back to the operating system.

(I’m assuming a GNU/Linux system with glibc, on x86. I don’t know how C libraries on other operating systems or architectures behave.)

The pointer set contains many 8-byte entries, so if I use malloc() then all that memory is never released back to the operating system even if I deallocate them. By using a memory pool, which allocates big chunks of memory at once, I force malloc to use mmap(). Therefore, all memory allocated for the pointer set entries can be released to the operating system. The memory used by the bucket array is not affected by this, since the bucket array is usually quite large, so malloc() already uses mmap() to allocate memory for the bucket array.

We're (still) not done yet!

Consider this second test script:

Ruby
  1. GC.disable
  2. require 'rubygems'
  3. gem 'rails'
  4. require 'set'
  5. require 'pathname'
  6. require 'cgi'
  7. require 'initializer'
  8.  
  9. GC.enable
  10. ObjectSpace.garbage_collect
  11. GC.disable
  12. pid = fork do
  13.         pid = fork do
  14.                 puts "forked"
  15.                 STDIN.readline
  16.  
  17.                 GC.enable
  18.                 if ObjectSpace.respond_to?(:garbage_collect!)
  19.                         ObjectSpace.garbage_collect!
  20.                 else
  21.                         ObjectSpace.garbage_collect
  22.                 end
  23.                 puts "garbage collected"
  24.                 STDIN.readline
  25.                 exit!
  26.         end
  27.         Process.waitpid(pid)
  28.         exit!
  29. end
  30. Process.waitpid(pid)

It loads some code, then forks and runs garbage collection. When run in standard Ruby, then the final writable memory is 4.7 MB. When run in my patched Ruby, the final writable memory is 4.1 MB. While this is an improvement, it's not much. Most of the memory occupied by the first test script is in strings. In this second test script, most of the memory is occupied by code (i.e. the AST). The garbage collector seems to write to AST nodes even when it's not necessary.

I also tested mongrel_light_cluster. After mongrel_light_cluster has forked, each child has about 15 MB of writable memory. After loading some pages, their writable memory jump to 17 MB. Compared to the 26 MB used by regular mongrel_cluster, this certainly is an improvement. But I'm still not satisfied. I'm pretty sure this can be optimized further. Somewhere, somehow, pages are being marked as dirty. Finding them (and finding a solution for them) will be hard.

Comments (9)

Making Ruby’s garbage collector copy-on-write friendly, part 2

The marking table now uses a set instead of a full hash table. This set implementation is based on st.c from the Ruby source distribution. I removed two fields from the hash table entry structure, so my set structure uses 8 bytes per entry instead of 16 bytes per entries (excluding malloc overhead). If you’re really curious, I changed

C
  1. struct st_table_entry {
  2.     unsigned int hash;
  3.     st_data_t key;
  4.     st_data_t record;
  5.     st_table_entry *next;
  6. };

to

C
  1. struct _PointerSetEntry {
  2.         PointerSetElement element;
  3.         PointerSetEntry *next;
  4. };

To my surprise, this didn’t save as much memory as I expected. The memory saving was only 500 KB instead of the expected 3.8 MB. :(

I later found out that a significant portion of memory is used by the bucket array in the set’s structure. So I increased the set’s default load factor from 0.75 to 2.0. This reduced memory usage in my test program from 16.5 MB to 13.5 MB! Despite collisions, the effect on performance is minimal. Increasing the load factor further doesn’t seem to save memory.

For my next attempt, I will be using a memory pool for allocating the set entries. This should reduce memory fragmentation, as well as increasing performance because allocations and frees and be done in constant time. Hopefully it will also have less memory overhead than malloc.

Comments (1)

Making Ruby’s garbage collector copy-on-write friendly

Remember that I said that I give up? Well, I lied. ;) Something is driving me. I guess I loathe the idea of having to pay an extra $20 per month for the extra RAM to run more Rails apps.

Sleep has to be the best thing ever invented. After a night of sleep and some careful thinking, I managed to modify Ruby’s garbage collector to use a marking table instead of marking the objects directly. The patch seems to work: ‘make test’ and ‘make install’ both run fine (both run a non-trivial Ruby script). Rubygems, which is notorious for using lots of memory and CPU, runs fine as well. Ruby on Rails also seems to work fine.

The Ruby 1.8.6 patch is here.

How it works

Every Ruby object is represented by a C struct, and always begins with the following field:

unsigned long flags;

Ruby uses mark and sweep garbage collection. First it goes through all objects and marks every object which is in use. This is done by setting the FL_MARK flag on the aforementioned field. Then it goes through all objects again, and frees every object which is not marked. The problem was that this breaks copy-on-write semantics: it effectively marks nearly all pages as dirty. Using a mark table, which keeps the marking information in a separate memory region, is supposed to solve that problem, as Dough Beaver mentioned in this slide. My patch modifies the garbage collector to do just that.

We’re not done yet

Unfortunately things aren’t that simple (are they ever simple? :( ). I notice that the memory gains aren’t as good as I hoped. So I wrote the following test script:

Ruby
  1. list = []
  2. 500000.times do
  3.         list < < "foo"
  4. end
  5.  
  6. pid = fork do
  7.         puts "forked"
  8.         STDIN.readline
  9.  
  10.         ObjectSpace.garbage_collect
  11.         puts "garbage collected"
  12.         STDIN.readline
  13.         exit!
  14. end
  15. Process.waitpid(pid)

I use the GNOME System Monitor to monitor the processes' memory usage. The 'Writable memory' column is especially interesting, as that column tells me how much memory is really private to that process (instead of being shared with the parent/child process). Immediate after forking, the child process's VM size is 29.1 MB, and the writable memory is 76 KB. After running the garbage collection though, the writable memory jumps to 16.9 MB.

Something is definitely wrong here! I know that nothing will be freed - all strings that I put in the list array should never be freed as they are still reachable.

After intensive debugging, I found out that the marking table is using quite some memory. The marking table uses the hash table implementation as provided by Ruby's st.c: st_table. st_table uses malloc() to allocate entries for the hash table. I noticed that each entry uses 24 bytes (this size is including malloc overhead). I have more than 500,000 strings in the list, so the marking table should contain at least 500,000 entries after the marking phase. 500,000 * 24 = 12,000,000. Uh oh, the marking table uses 12 MB of memory during the marking phase. :(

So the main issue now is to find a way to reduce the marking table's memory usage. I could port Google's sparse_hash to C. sparse_hash is a bit slower than a regular hash table, but has only 2 bits of overhead per entry. This would theoretically reduce memory usage from 12 MB to (32 * 500,000 + 2 * 500,000) / 8 bytes = 2 MB.

Does anyone have any better ideas?

Comments (7)

« Previous entries Next Page » Next Page »