Making IRC channels friendlier with ZNC

It has been a while since I’ve joined the #passenger channel on irc.freenode.net, but I notice that people regularly join that channel for support. It would be a shame if a user joins the channel, asks a question, then leaves frustration because nobody responds. However I cannot watch IRC all day. How do we solve this problem?

znc

Luckily there are programs that can aid me in this. Luuk uses a so-called “IRC bouncer” to stay online even when his client is disconnected. The bouncer runs on a Unix server, and the client connects to the bouncer instead of the IRC server. When he connects his client the bouncer will replay messages that he missed.

There are many IRC bouncers out there. After some research it would appear that ZNC is the most actively maintained one. It’s written in C++ and is extensible through C++, Python and Perl.

Armed with ZNC, I’ve setup the following things:

  • I’ve configured ZNC to automatically set me away on IRC when my last client disconnects. My away message tells the user that my timezone is in Europe, and kindly refers him to the Phusion Passenger support page in case I don’t respond quickly enough.
  • I’ve also setup my IRC client to display a notification whenever my nick is mentioned in the channel, in case I am behind (or nearby) the computer but not watching the IRC window.
  • Finally, I’ve setup a channel topic that welcomes users, and tells them to mention my nick in their message to nudge me.

So far so good, but this alone isn’t enough. Not all IRC clients display away users in a special way, so users may not be aware that I am away. Also they may not read the channel topic when they enter the channel. ZNC doesn’t have any built-in functionality that can aid me further, so that’s when I decided to write some ZNC modules. Their API looks fairly easy.

Octocat

The result is on Github. I’ve written two modules:

  • channel_away.cpp: if I’m away, then whenever someone sends a channel message that starts with my nick, ZNC will send an autoreply.
  • welcome_to_passenger.cpp: whenever someone joins #passenger, ZNC will send that person a welcome message, encouraging that person to ask a question and explaining to that person that I may be away, so please visit the support website if nobody answers.

I hope that #passenger has become a much friendlier place now and that my ZNC modules are useful to other people. :D

Comments

Blog restored

My blog has been down for a while because of a DNS problem. It’s finally fixed. :)

Comments

3 Node.js gotchas that I’ve found in the past 3 days

I’ve been writing a NoSQL database server in Node.js. Now before anybody lynch me for mentioning 2 buzzwords or for the fact that I’m writing a database server, please know that I have perfectly valid and informed reasons for doing this.

Ignoring the oddity that is Javascript, one of the earliest issues that I faced was the fact that there seemed to be no way to be notified of when a fs.createWriteStream(...).write(...) completes. WriteStream has a ‘drain’ event but I’m not interested in when the entire write queue is drained, I’m only interested in when a specific write has been drained. After reading the Node.js code it turned out that WriteStream.write() does accept a callback for completion notification, it just isn’t documented.

Another issue was that EPIPE errors on the socket would cause Node.js to throw an exception. I was initially confused because I thought such errors would be emitted through the ‘error’ event, which I haven’t bound on yet. In a mailing list post Ryan Dahl said that errors are thrown as exceptions except when the ‘error’ event is bound, which he admits is a rather dubious behavior. Again this isn’t documented.

I also found out that Node.js would close the connection whenever it encounters a socket error. Any socket error. This is also rather dubious behavior because it would make it impossible to handle certain edge cases involving EPIPE. Suppose that the server is sending some data to the client. The client hasn’t acknowledged that yet, but sends a bunch of data to the server and closes the connection. The server will now encounter an EPIPE error upon writing, making it impossible to read the remaining data that the client has sent. Granted, such a protocol is rather dubious by itself but I do think Node.js should make it at least possible to handle this situation.

Those who want to develop in Node.js have been warned.

Comments (6)

Phusion Passenger for Python?

Today I came across several postings on various websites about uWSGI, a WSGI server and Apache module written in C. As WSGI bears a lot of similarities to Rack – after all, Rack was based on WSGI – this caught my interest. I think it’s safe to at least say that – before Phusion Passenger came to the scene – Python web app deployment was less of a pain than Ruby web app deployment, thanks to mod_wsgi as well as possibly other Python hosting modules or application servers that I might not know of. Both mod_wsgi and uWSGI differ significantly from Phusion Passenger when it comes to design choices, though the common goal is web app deployment. The fact that mod_wsgi and uWSGI are developed independently from Phusion Passenger makes them all the more interesting. It’s refreshing to see how other people try to tackle the same problem.

What is the state of Python web app deployment today? Is it on par with Ruby? I honestly don’t know because although I’ve written plenty of Python code (mostly desktop apps and server scripts), I’ve never written a Python web app before beyond hello world, nor deployed one. Which is why this comment on Hacker News caught my attention:

Much needed. We pythonistas are still waiting for a deployment option as slick as ruby passenger.

So there is at least one person in the world who thinks that Phusion Passenger is slick and apparently better than anything that’s current available in Python land. Why, I’m flattered. :) However, I don’t know whether this is just the opinion of one man or whether there’s a general consensus in the Python community that deployment can and should be better.

What is better? As one of the authors of Phusion Passenger I define “better” as “easier”, “smoother”, “less painful”, which should also reveal the philosophy behind many of Phusion Passenger’s design choices. Not everybody agrees with Phusion Passenger’s design choices though. Graham Dumpleton, author of mod_wsgi, thinks… well let’s keep it at saying that he doesn’t think Phusion Passenger is a work of art. Needless to say I do not agree with his opinion nor his reasoning behind it. But Graham probably had other design choices and design philosophies than Phusion Passenger so he’s entitled to his opinions.

If there are any Pythonistas out there, please share your thoughts! Do you think there would be any demand for a hypothetical “Phusion Passenger for Python”? If not, what do you think about the current Python web app deployment situation? Do you think there’s room for improvement?

Comments (14)

New way to hide the banner

For those who want to be able to pass URLs of my blog to co-workers without showing them the banner: it is now possible. If you add “?hide_banner=yes” to the URL then the banner will be hidden automatically, without needing them to click on the “Censor banner” link. For example:

http://izumi.plan99.net/blog/?hide_banner=yes

This will also automatically set a cookie so that subsequent page loads won’t show the banner by default.

Comments (3)

Hunting down obscure GC bugs

It has been approximately a year since my copy-on-write work on the Ruby garbage collector was released in a form that’s suitable for production environments, namely in the form of Ruby Enterprise Edition. In the past half year its usage has grown dramatically, being adopted by, among others, New York Times, Shopify and 37signals.

However, there are a few reports of REE freezing after a while, and it seems to occur randomly. A few people have posted backtraces of the affected Ruby processes, and the backtrace always shows the function finalize_list. Sometimes the backtrace shows that the topmost function in the call stack is st_delete, and sometimes it does not.

Is there anything more frustrating to a programmer who wants to help, than not being able to reproduce the problem? I could not reproduce it at all, not on my development machine and not on any of our servers. The problem seems to be rare because otherwise a lot more people should have reported this. There’s only one way to combat this problem: the hard way, by performing a thorough code inspection. This is not an easy task because the garbage collector source file, gc.c, consists of 2945 lines. There are about 10 lines of non-API comments, of which 7 are written by myself. There are many idoms in the Ruby code base that are hard to read.

Let’s start at st_delete, located at st.c. st.c is an implementation of a hash table, used all over the place in the Ruby codebase, for example for storing instance variables in a class. The st_delete function is for – can you guess it – deleting an entry from a hash table. Can this function loop infinitely?

C
  1. int
  2. st_delete(table, key, value)
  3.     register st_table *table;
  4.     register st_data_t *key;
  5.     st_data_t *value;
  6. {
  7.     unsigned int hash_val;
  8.     st_table_entry *tmp;
  9.     register st_table_entry *ptr;
  10.  
  11.     hash_val = do_hash_bin(*key, table);
  12.     ptr = table->bins[hash_val];
  13.  
  14.     if (ptr == 0) {
  15.         if (value != 0) *value = 0;
  16.         return 0;
  17.     }
  18.  
  19.     if (EQUAL(table, *key, ptr->key)) {
  20.         table->bins[hash_val] = ptr->next;
  21.         table->num_entries–;
  22.         if (value != 0) *value = ptr->record;
  23.         *key = ptr->key;
  24.         free(ptr);
  25.         return 1;
  26.     }
  27.  
  28.     for(; ptr->next != 0; ptr = ptr->next) {
  29.         if (EQUAL(table, ptr->next->key, *key)) {
  30.             tmp = ptr->next;
  31.             ptr->next = ptr->next->next;
  32.             table->num_entries–;
  33.             if (value != 0) *value = tmp->record;
  34.             *key = tmp->key;
  35.             free(tmp);
  36.             return 1;
  37.         }
  38.     }
  39.  
  40.     return 0;
  41. }

There is at least one infinite looping candidate in this function: the for loop in the bottom part of the function. The only way in which the ‘for’ loop can be infinite is if the linked list that the hash bucket refers to has a circular reference.

st_delete is called by run_final in gc.c. run_final runs all finalizers on a given object. Finalizers are pieces of codes that can be attached to an object. They are run some time after the garbage collector has freed an object. For example, socket objects have finalizers that close the underlying socket connection.

The run_final function looks like this:

C
  1. static void
  2. run_final(obj)
  3.     VALUE obj;
  4. {
  5.     long i;
  6.     int status, critical_save = rb_thread_critical;
  7.     VALUE args[3], table, objid;
  8.  
  9.     objid = rb_obj_id(obj);     /* make obj into id */
  10.     rb_thread_critical = Qtrue;
  11.     args[1] = 0;
  12.     args[2] = (VALUE)ruby_safe_level;
  13.     for (i=0; i<RARRAY(finalizers)->len; i++) {
  14.         args[0] = RARRAY(finalizers)->ptr[i];
  15.         if (!args[1]) args[1] = rb_ary_new3(1, objid);
  16.         rb_protect((VALUE(*)_((VALUE)))run_single_final, (VALUE)args, &status);
  17.     }
  18.     if (finalizer_table && st_delete(finalizer_table, (st_data_t*)&obj, &table)) {
  19.         for (i=0; i<RARRAY(table)->len; i++) {
  20.             VALUE final = RARRAY(table)->ptr[i];
  21.             args[0] = RARRAY(final)->ptr[1];
  22.             if (!args[1]) args[1] = rb_ary_new3(1, objid);
  23.             args[2] = FIX2INT(RARRAY(final)->ptr[0]);
  24.             rb_protect((VALUE(*)_((VALUE)))run_single_final, (VALUE)args, &status);
  25.         }
  26.     }
  27.     rb_thread_critical = critical_save;
  28. }

There are two loops in this function, and both will always terminate because RARRAY(...)->len is bounded.

run_final is called by finalize_list. This function loops through a list of objects and calls run_final on each object:

C
  1. static void
  2. finalize_list(p)
  3.     RVALUE *p;
  4. {
  5.     while (p) {
  6.         RVALUE *tmp = p->as.free.next;
  7.         run_final((VALUE)p);
  8.         /* Don’t free objects that are singletons, or objects that are already freed.
  9.          * The latter is to prevent the unnecessary marking of memory pages as dirty,
  10.          * which can destroy copy-on-write semantics.
  11.          */
  12.         if (!FL_TEST(p, FL_SINGLETON) && p->as.free.flags != 0) {
  13.             p->as.free.flags = 0;
  14.             p->as.free.next = freelist;
  15.             rb_mark_table_remove(p);
  16.             freelist = p;
  17.         }
  18.         p = tmp;
  19.     }
  20. }

This function can loop infinitely if the linked list (p) has a circular reference.

The list of possible culprits that I’ve compiled so far is not exhaustive. Although unlikely, *any* function called by one of these functions could contain an infinite loop. It’s almost impossible to review all possible code paths. Indeed, I’ve struggled with this problem for months, unable to find out the cause, or even verify that it’s indeed REE’s fault and not the fault of some native extension.

Until last week, that is. I’ve been working with Joshua Sierles, system administrator at 37signals, to solve this problem. Once in a while one of their REE processes would freeze in an infinite loop, called from finalize_list. One of the many things that we had tried is to inspect such stuck processes with gdb, but their REE binaries didn’t have debugging symbols. So after reinstalling REE with debugging symbols and after some debugging sessions, we were able to infer that the infinite loop occurs in finalize_list: the linked list contains a circular reference!

But how can such a circular reference occur? This infinite looping bug doesn’t occur in normal Ruby, so it’s probably an REE-specific bug. The fact that the infinite loop occurs in finalize_list probably means that I have a bug somewhere in the finalizers handling code, so I decided to inspect all finalizers-related code that I’ve modified. After spending a lot of time digging through the source code, I could not find any obvious places that could create circular references in the finalizers list, but I did remember an interesting change that I made in the past. Consider the following code in gc_mark_all() in the original Ruby source code:

C
  1. while (p < pend) {
  2.     if ((p->as.basic.flags & FL_MARK) &&
  3.         (p->as.basic.flags != FL_MARK)) {
  4.         gc_mark_children((VALUE)p, 0);
  5.     }
  6.     p++;
  7. }

Standard Ruby uses the FL_MARK flag in the ‘flags’ member to indicate that an object is marked by the garbage collector. This makes the above ‘if’ statement especially interesting: it checks whether FL_MARK is set, but it also checks whether FL_MARK isn’t the only flag in the ‘flags’ field. I didn’t fully understand what this means, but I did understand that because of this I can’t just get rid of the FL_MARK flag, and replace FL_MARK bitmask operations with calls to the mark table. This code here tells me that the codebase assumes that if an object is marked, then its flags is not 0 (a flag of 0 means that the object is a free slot). If I naively replace FL_MARK operations with mark table calls then this assumption will be violated. So I introduced the FL_ALLOCATED flag. Every object that is not free will have this flag set. Then I changed the ‘if’ statement to:

C
  1. if (rb_mark_table_heap_contains(heap, p) &&
  2.     (p->as.basic.flags != FL_ALLOCATED)) {
  3.     gc_mark_children((VALUE)p, 0);
  4. }

This should have the same semantics as the original, or so I thought. After all it checks whether the object is marked, but whether it’s not the case that there are no other flags set.

Consider also this code in gc_sweep(), the function that frees all non-marked objects:

C
  1. while (p < pend) {
  2.     if (!(p->as.basic.flags & FL_MARK)) {
  3.         if (p->as.basic.flags) {
  4.             obj_free((VALUE)p);
  5.         }
  6.         if (need_call_final && FL_TEST(p, FL_FINALIZE)) {
  7.             p->as.free.flags = FL_MARK; /* remain marked */
  8.             p->as.free.next = final_list;
  9.             final_list = p;
  10.         }
  11.         else {
  12.             p->as.free.flags = 0;
  13.             p->as.free.next = freelist;
  14.             freelist = p;
  15.         }
  16.         n++;
  17.     }
  18.     else if (RBASIC(p)->flags == FL_MARK) {
  19.         /* objects to be finalized */
  20.         /* do nothing remain marked */
  21.     }
  22.     else {
  23.         RBASIC(p)->flags &= ~FL_MARK;
  24.         live++;
  25.     }
  26.     p++;
  27. }

This piece of code iterates through a Ruby heap, and frees the objects that are not marked.
if (need_call_final && FL_TEST(p, FL_FINALIZE)) { checks whether the current object has any finalizers.
The line else if (RBASIC(p)->flags == FL_MARK) { is similar to the check in gc_mark_all(). I replaced this line with else if (RBASIC(p)->flags == FL_ALLOCATED) {
I also changed the line p->as.free.flags = FL_MARK; /* remain marked */ to rb_mark_table_heap_add(heap, p); /* remain marked */

Because of the recent code inspection, I discovered that the latter change is not correct. The original code not only keeps the object marked, but it also removes all other flags. I wondered whether this may have anything to do with the infinite loop bug. So I inserted the following code before the rb_mark_table_heap_add call:

C
  1. p->as.free.flags = FL_ALLOCATED;

The semantics should now be equal to the original’s… right?

It is clear that if I do not insert this call, then the object’s existing flags will be preserved. p->as.basic.flags != FL_MARK in gc_mark_all() will then always be true and RBASIC(p)->flags == FL_MARK will always be false. This is clearly undesirable.

After some testing, it would seem that the change doesn’t break anything. The resulting Ruby interpreter was able to install the Rails gem and generate the rdoc for it, a fairly memory-intensive operation. However, I wasn’t sure whether this also tests the finalizers handling code. Then the thought that finalizers could be plain Ruby code and thus could trigger a garbage collection occurred to me. If I trigger a garbage collection inside a finalizer, will things break?

Ruby
  1. def foo
  2.         x = "hello world"
  3.         ObjectSpace.define_finalizer(x, lambda { |x| GC.start })
  4.         x = nil
  5. end
  6.  
  7. def bar
  8.         x = 1 + 1
  9.         y = x * 2
  10.         ["", "", ""]
  11.         return y * 3
  12. end
  13.  
  14. 5000.times do |i|
  15.         foo
  16. end
  17. puts "done"

Yes. Yes it will, as was clear after executing the above test script that I wrote. It would cause Ruby to abort with a message along the lines of “rb_gc_mark(): unknown data type 0×0(0xYYYY) corrupted object”. This is certainly very suspicious. Does this mean that my change w.r.t. FL_ALLOCATED is incorrect, and could be the cause of the infinite loop bug?

I spent several hours debugging this with gdb, but at the end of the day I was still as clueless as several hours ago. I knew that something went wrong, but I didn’t know why. But that was probably because I was tired. The next morning I found the solution. I spent an hour thinking about what the original code is actually trying to do, something which I didn’t fully understand before.

  1. I figured that the == FL_MARK, != FL_MARK and = FL_MARK code must have something to do with finalizers.
  2. The code around = FL_MARK is trying to keep the object alive so that it could be finalized later. This makes sense. In addition, it is apparently also a guard that prevents any garbage collection runs invoked by any of the finalizers from freeing this object. It does so by making sure that RBASIC(p)->flag == FL_MARK returns true. This prevents gc_sweep from removing the object’s marking in the ‘else’ block.
  3. In a standard mark-and-sweep implementation, all objects will be unmarked after a garbage collection run. The previous findings imply that this is not the case in Ruby’s implementation! My copy-on-write friendly mark table clears all markings after a garbage collection run, which is apparently not correct.

So setting an object’s flag to FL_MARK actually means that the object is to be finalized later, and should not be freed until its finalizer has run. == FL_MARK actually checks whether an object is marked to be finalized later, and != FL_MARK checks the negation.

With this in mind, I made the following changes: http://github.com/FooBarWidget/rubyenterpriseedition/commit/63247bf87b11c9acc2dfd1ed5f98b1ed8dd4c780
I removed the FL_ALLOCATED flag and introduced an FL_DEFER_FINALIZE flag. I replaced the FL_MARK bitmask checks with FL_DEFER_FINALIZE checks. This change has been tested in production for a while now by 37signals. Everything’s been running smoothly ever since.

I’d really like to thank Joshua for his patience. Thanks to the debugging sessions that he made possible, I was able to finally track down the cause of the infinite loop bug.

We’re going to release a new version of Ruby Enterprise Edition soon, which includes this bug fix.

It’s amazing sometimes how so much thinking work only results in 10 lines of code. If you like my work, please consider sending a donation and/or to recommend me on Working With Rails. Thank you. :)

Comments (5)

Phusion Passenger 2.1.2 (final) released

Please read the announcement for details.

Comments

Phusion Passenger 2.1.1 (beta) released

Please read the announcement for details.

Comments

Saturday night drawing session

Sketch: preview
(click for full version)

Drawing faces has got to be one of the most difficult exercises in life. Even being off by 1 millimeter can change a face significantly. Nevertheless, I’m pretty content with the result this time; it’s definitely better than my last attempt.

I spent about 1.5 hours on this sketch.

For comparison, here’s the original:
Original: preview
(click for full version)

Comments (4)

Phusion Passenger Pledgie Paypal page, now in English again

The Phusion Passenger community sponsorship campaign has progressed very quickly. It’s been a few days and we’re already at 27%! Thanks for your support!

However, many people have reported that the Pledgie Paypal page is in Dutch. We’ve contacted Pledgie, and it turns out to be a bug in Paypal. The nice guys at Pledgie have implemented a workaround for this problem, so everything should be in English again. Our apologies for the inconvenience!

Click here to lend your support to: Phusion Passenger first community sponsorship and make a donation at www.pledgie.com !

NOTE: Some people have reported that they’re unable to pay with American Express credit cards. We’re currently contacting Paypal about this problem.

Comments

« Previous entries Next Page » Next Page »