Hunting down obscure GC bugs

It has been approximately a year since my copy-on-write work on the Ruby garbage collector was released in a form that’s suitable for production environments, namely in the form of Ruby Enterprise Edition. In the past half year its usage has grown dramatically, being adopted by, among others, New York Times, Shopify and 37signals.

However, there are a few reports of REE freezing after a while, and it seems to occur randomly. A few people have posted backtraces of the affected Ruby processes, and the backtrace always shows the function finalize_list. Sometimes the backtrace shows that the topmost function in the call stack is st_delete, and sometimes it does not.

Is there anything more frustrating to a programmer who wants to help, than not being able to reproduce the problem? I could not reproduce it at all, not on my development machine and not on any of our servers. The problem seems to be rare because otherwise a lot more people should have reported this. There’s only one way to combat this problem: the hard way, by performing a thorough code inspection. This is not an easy task because the garbage collector source file, gc.c, consists of 2945 lines. There are about 10 lines of non-API comments, of which 7 are written by myself. There are many idoms in the Ruby code base that are hard to read.

Let’s start at st_delete, located at st.c. st.c is an implementation of a hash table, used all over the place in the Ruby codebase, for example for storing instance variables in a class. The st_delete function is for – can you guess it – deleting an entry from a hash table. Can this function loop infinitely?

C
  1. int
  2. st_delete(table, key, value)
  3.     register st_table *table;
  4.     register st_data_t *key;
  5.     st_data_t *value;
  6. {
  7.     unsigned int hash_val;
  8.     st_table_entry *tmp;
  9.     register st_table_entry *ptr;
  10.  
  11.     hash_val = do_hash_bin(*key, table);
  12.     ptr = table->bins[hash_val];
  13.  
  14.     if (ptr == 0) {
  15.         if (value != 0) *value = 0;
  16.         return 0;
  17.     }
  18.  
  19.     if (EQUAL(table, *key, ptr->key)) {
  20.         table->bins[hash_val] = ptr->next;
  21.         table->num_entries–;
  22.         if (value != 0) *value = ptr->record;
  23.         *key = ptr->key;
  24.         free(ptr);
  25.         return 1;
  26.     }
  27.  
  28.     for(; ptr->next != 0; ptr = ptr->next) {
  29.         if (EQUAL(table, ptr->next->key, *key)) {
  30.             tmp = ptr->next;
  31.             ptr->next = ptr->next->next;
  32.             table->num_entries–;
  33.             if (value != 0) *value = tmp->record;
  34.             *key = tmp->key;
  35.             free(tmp);
  36.             return 1;
  37.         }
  38.     }
  39.  
  40.     return 0;
  41. }

There is at least one infinite looping candidate in this function: the for loop in the bottom part of the function. The only way in which the ‘for’ loop can be infinite is if the linked list that the hash bucket refers to has a circular reference.

st_delete is called by run_final in gc.c. run_final runs all finalizers on a given object. Finalizers are pieces of codes that can be attached to an object. They are run some time after the garbage collector has freed an object. For example, socket objects have finalizers that close the underlying socket connection.

The run_final function looks like this:

C
  1. static void
  2. run_final(obj)
  3.     VALUE obj;
  4. {
  5.     long i;
  6.     int status, critical_save = rb_thread_critical;
  7.     VALUE args[3], table, objid;
  8.  
  9.     objid = rb_obj_id(obj);     /* make obj into id */
  10.     rb_thread_critical = Qtrue;
  11.     args[1] = 0;
  12.     args[2] = (VALUE)ruby_safe_level;
  13.     for (i=0; i<RARRAY(finalizers)->len; i++) {
  14.         args[0] = RARRAY(finalizers)->ptr[i];
  15.         if (!args[1]) args[1] = rb_ary_new3(1, objid);
  16.         rb_protect((VALUE(*)_((VALUE)))run_single_final, (VALUE)args, &status);
  17.     }
  18.     if (finalizer_table && st_delete(finalizer_table, (st_data_t*)&obj, &table)) {
  19.         for (i=0; i<RARRAY(table)->len; i++) {
  20.             VALUE final = RARRAY(table)->ptr[i];
  21.             args[0] = RARRAY(final)->ptr[1];
  22.             if (!args[1]) args[1] = rb_ary_new3(1, objid);
  23.             args[2] = FIX2INT(RARRAY(final)->ptr[0]);
  24.             rb_protect((VALUE(*)_((VALUE)))run_single_final, (VALUE)args, &status);
  25.         }
  26.     }
  27.     rb_thread_critical = critical_save;
  28. }

There are two loops in this function, and both will always terminate because RARRAY(...)->len is bounded.

run_final is called by finalize_list. This function loops through a list of objects and calls run_final on each object:

C
  1. static void
  2. finalize_list(p)
  3.     RVALUE *p;
  4. {
  5.     while (p) {
  6.         RVALUE *tmp = p->as.free.next;
  7.         run_final((VALUE)p);
  8.         /* Don’t free objects that are singletons, or objects that are already freed.
  9.          * The latter is to prevent the unnecessary marking of memory pages as dirty,
  10.          * which can destroy copy-on-write semantics.
  11.          */
  12.         if (!FL_TEST(p, FL_SINGLETON) && p->as.free.flags != 0) {
  13.             p->as.free.flags = 0;
  14.             p->as.free.next = freelist;
  15.             rb_mark_table_remove(p);
  16.             freelist = p;
  17.         }
  18.         p = tmp;
  19.     }
  20. }

This function can loop infinitely if the linked list (p) has a circular reference.

The list of possible culprits that I’ve compiled so far is not exhaustive. Although unlikely, *any* function called by one of these functions could contain an infinite loop. It’s almost impossible to review all possible code paths. Indeed, I’ve struggled with this problem for months, unable to find out the cause, or even verify that it’s indeed REE’s fault and not the fault of some native extension.

Until last week, that is. I’ve been working with Joshua Sierles, system administrator at 37signals, to solve this problem. Once in a while one of their REE processes would freeze in an infinite loop, called from finalize_list. One of the many things that we had tried is to inspect such stuck processes with gdb, but their REE binaries didn’t have debugging symbols. So after reinstalling REE with debugging symbols and after some debugging sessions, we were able to infer that the infinite loop occurs in finalize_list: the linked list contains a circular reference!

But how can such a circular reference occur? This infinite looping bug doesn’t occur in normal Ruby, so it’s probably an REE-specific bug. The fact that the infinite loop occurs in finalize_list probably means that I have a bug somewhere in the finalizers handling code, so I decided to inspect all finalizers-related code that I’ve modified. After spending a lot of time digging through the source code, I could not find any obvious places that could create circular references in the finalizers list, but I did remember an interesting change that I made in the past. Consider the following code in gc_mark_all() in the original Ruby source code:

C
  1. while (p < pend) {
  2.     if ((p->as.basic.flags & FL_MARK) &&
  3.         (p->as.basic.flags != FL_MARK)) {
  4.         gc_mark_children((VALUE)p, 0);
  5.     }
  6.     p++;
  7. }

Standard Ruby uses the FL_MARK flag in the ‘flags’ member to indicate that an object is marked by the garbage collector. This makes the above ‘if’ statement especially interesting: it checks whether FL_MARK is set, but it also checks whether FL_MARK isn’t the only flag in the ‘flags’ field. I didn’t fully understand what this means, but I did understand that because of this I can’t just get rid of the FL_MARK flag, and replace FL_MARK bitmask operations with calls to the mark table. This code here tells me that the codebase assumes that if an object is marked, then its flags is not 0 (a flag of 0 means that the object is a free slot). If I naively replace FL_MARK operations with mark table calls then this assumption will be violated. So I introduced the FL_ALLOCATED flag. Every object that is not free will have this flag set. Then I changed the ‘if’ statement to:

C
  1. if (rb_mark_table_heap_contains(heap, p) &&
  2.     (p->as.basic.flags != FL_ALLOCATED)) {
  3.     gc_mark_children((VALUE)p, 0);
  4. }

This should have the same semantics as the original, or so I thought. After all it checks whether the object is marked, but whether it’s not the case that there are no other flags set.

Consider also this code in gc_sweep(), the function that frees all non-marked objects:

C
  1. while (p < pend) {
  2.     if (!(p->as.basic.flags & FL_MARK)) {
  3.         if (p->as.basic.flags) {
  4.             obj_free((VALUE)p);
  5.         }
  6.         if (need_call_final && FL_TEST(p, FL_FINALIZE)) {
  7.             p->as.free.flags = FL_MARK; /* remain marked */
  8.             p->as.free.next = final_list;
  9.             final_list = p;
  10.         }
  11.         else {
  12.             p->as.free.flags = 0;
  13.             p->as.free.next = freelist;
  14.             freelist = p;
  15.         }
  16.         n++;
  17.     }
  18.     else if (RBASIC(p)->flags == FL_MARK) {
  19.         /* objects to be finalized */
  20.         /* do nothing remain marked */
  21.     }
  22.     else {
  23.         RBASIC(p)->flags &= ~FL_MARK;
  24.         live++;
  25.     }
  26.     p++;
  27. }

This piece of code iterates through a Ruby heap, and frees the objects that are not marked.
if (need_call_final && FL_TEST(p, FL_FINALIZE)) { checks whether the current object has any finalizers.
The line else if (RBASIC(p)->flags == FL_MARK) { is similar to the check in gc_mark_all(). I replaced this line with else if (RBASIC(p)->flags == FL_ALLOCATED) {
I also changed the line p->as.free.flags = FL_MARK; /* remain marked */ to rb_mark_table_heap_add(heap, p); /* remain marked */

Because of the recent code inspection, I discovered that the latter change is not correct. The original code not only keeps the object marked, but it also removes all other flags. I wondered whether this may have anything to do with the infinite loop bug. So I inserted the following code before the rb_mark_table_heap_add call:

C
  1. p->as.free.flags = FL_ALLOCATED;

The semantics should now be equal to the original’s… right?

It is clear that if I do not insert this call, then the object’s existing flags will be preserved. p->as.basic.flags != FL_MARK in gc_mark_all() will then always be true and RBASIC(p)->flags == FL_MARK will always be false. This is clearly undesirable.

After some testing, it would seem that the change doesn’t break anything. The resulting Ruby interpreter was able to install the Rails gem and generate the rdoc for it, a fairly memory-intensive operation. However, I wasn’t sure whether this also tests the finalizers handling code. Then the thought that finalizers could be plain Ruby code and thus could trigger a garbage collection occurred to me. If I trigger a garbage collection inside a finalizer, will things break?

Ruby
  1. def foo
  2.         x = "hello world"
  3.         ObjectSpace.define_finalizer(x, lambda { |x| GC.start })
  4.         x = nil
  5. end
  6.  
  7. def bar
  8.         x = 1 + 1
  9.         y = x * 2
  10.         ["", "", ""]
  11.         return y * 3
  12. end
  13.  
  14. 5000.times do |i|
  15.         foo
  16. end
  17. puts "done"

Yes. Yes it will, as was clear after executing the above test script that I wrote. It would cause Ruby to abort with a message along the lines of “rb_gc_mark(): unknown data type 0x0(0xYYYY) corrupted object”. This is certainly very suspicious. Does this mean that my change w.r.t. FL_ALLOCATED is incorrect, and could be the cause of the infinite loop bug?

I spent several hours debugging this with gdb, but at the end of the day I was still as clueless as several hours ago. I knew that something went wrong, but I didn’t know why. But that was probably because I was tired. The next morning I found the solution. I spent an hour thinking about what the original code is actually trying to do, something which I didn’t fully understand before.

  1. I figured that the == FL_MARK, != FL_MARK and = FL_MARK code must have something to do with finalizers.
  2. The code around = FL_MARK is trying to keep the object alive so that it could be finalized later. This makes sense. In addition, it is apparently also a guard that prevents any garbage collection runs invoked by any of the finalizers from freeing this object. It does so by making sure that RBASIC(p)->flag == FL_MARK returns true. This prevents gc_sweep from removing the object’s marking in the ‘else’ block.
  3. In a standard mark-and-sweep implementation, all objects will be unmarked after a garbage collection run. The previous findings imply that this is not the case in Ruby’s implementation! My copy-on-write friendly mark table clears all markings after a garbage collection run, which is apparently not correct.

So setting an object’s flag to FL_MARK actually means that the object is to be finalized later, and should not be freed until its finalizer has run. == FL_MARK actually checks whether an object is marked to be finalized later, and != FL_MARK checks the negation.

With this in mind, I made the following changes: http://github.com/FooBarWidget/rubyenterpriseedition/commit/63247bf87b11c9acc2dfd1ed5f98b1ed8dd4c780
I removed the FL_ALLOCATED flag and introduced an FL_DEFER_FINALIZE flag. I replaced the FL_MARK bitmask checks with FL_DEFER_FINALIZE checks. This change has been tested in production for a while now by 37signals. Everything’s been running smoothly ever since.

I’d really like to thank Joshua for his patience. Thanks to the debugging sessions that he made possible, I was able to finally track down the cause of the infinite loop bug.

We’re going to release a new version of Ruby Enterprise Edition soon, which includes this bug fix.

It’s amazing sometimes how so much thinking work only results in 10 lines of code. If you like my work, please consider sending a donation and/or to recommend me on Working With Rails. Thank you. :)

5 Comments »

  1. Ben Reubenstein said,

    April 12, 2009 @ 3:33 pm

    Great write up. I can appreciate how frustrating tracking down a bug like that can be.

  2. Mischa said,

    April 25, 2009 @ 9:16 pm

    Great story, thanks for writing up!

  3. Mike said,

    November 4, 2009 @ 2:20 pm

    Thank you very much!
    We use ruby-enterprise-1.8.6-20090201 with passenger-2.0.6 and experienced that problem with image uploads using fleximage. We’re going to upgrade now.

  4. Sebastian said,

    January 15, 2010 @ 11:44 am

    Thanks for your work.

  5. Laura said,

    March 4, 2011 @ 4:46 am

    Cool I have been looking for some help to get rid of the GC Bugs in my project

RSS feed for comments on this post · TrackBack URI

Leave a Comment