Optimizing RDoc

When it comes to documentation, there’s always room for improvement. I’ve been contributing to docrails for quite a while now. However, one issue that has always annoyed me is how slow RDoc is.

In particular, generating RDoc output for ActiveRecord is painfully slow. It takes 34.2 seconds on my machine. Using JRuby helps a little, seeing that it can do JIT and all, but not much: it still takes 29 seconds (including JVM startup time). I had hoped for a bigger performance improvement.

So a few days ago I started optimizing RDoc. And not without result. Here are the results of generating RDoc for the Ruby 1.8 standard library with RDoc running in MRI 1.8:
Before optimizing: 1624.8 seconds (27 minutes). I had to close all applications because RDoc ate almost 1.5 GB of memory.
After optimizing: 531.2 seconds (8 minutes and 51 seconds). RDoc now uses only about half the memory it used to need.
Performance improvement: 206%!

rdoc_generation_time_for_ruby_1_8_standard_library.png

The results for generating RDoc for ActiveRecord are:

Before optimizing (MRI 1.8): 34.2 seconds
After optimizing (MRI 1.8): 29.9 seconds
Performance improvement: 14%

Before optimizing (MRI 1.9): 16.8 seconds
After optimizing (MRI 1.9): 13.2 seconds
Performance improvement: 27%

Before optimizing (JRuby): 29.0 seconds
After optimizing (JRuby): 24.5 seconds
Performance improvement: 18%

rdoc_generation_time_for_activerecord.png

So what did I do? Read on!

Multi-threaded parsing

My laptop has a dual core CPU. It’s kind of a waste to see RDoc utilizing only a single core. RDoc can be parallelized by running 2 rdoc processes at the same time. But I’m never generating 2 different sets of RDoc documentation at the same time, so running 2 rdoc processes in parallel won’t do me any good. So I had to search for ways to parallelize a single rdoc process.

It turns out that RDoc’s code parsing phase is pretty easy to parallelize. It just creates a new parser object for every input file, and parses files sequentially. So I modified RDoc to create multiple worker threads during the parsing phase. The main thread will offer a list of filenames, and the worker threads will each consume filenames and parse the corresponding file as fast as they can.

What I did not try to do however, is making the parser itself multi-threaded. That’s just asking for problems. Those of you who are interested in multi-threaded programming, but aren’t experienced with it, should keep this in mind: try to keep your threading code as simple as possible, and make sure that your threads share as little data as possible. In my case, the only things that the threads share are the filename queue and the result aggregation array.

MRI 1.8 implements userspace threads, so having multiple threads doesn’t increase CPU core utilization. MRI 1.9 uses kernel threads, but it has a global interpreter lock, so it can’t utilize multiple CPU cores either. Luckily, JRuby doesn’t have a global interpreter lock. The parsing phase is now about 35% faster, when run on JRuby on my dual core machine.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22555&group_id=627&atid=2474

Cache template pages

For larger projects, the slowest phase of RDoc is probably the HTML generation phase. The latest version of RDoc uses ERB templates for HTML output. RDoc generates an HTML file for every class, every module and every source file. However after some profiling with ruby-prof, I found out that RDoc recompiles the ERB template for every output file, which is slow.

I modified RDoc so that it caches compiled ERB templates. This made it about 19% faster.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22556&group_id=627&atid=2474

Reduce the number of relative URL generations

For every output file, RDoc generates a list of URLs to this output file, relative to the output filenames of all classes, files and methods. This is done by calling the function index_to_links, which runs in O(n) time. For every output file, index_to_links is called 3 times, each time with the full list of classes, files and methods. Suppose there are K output files, L classes, N files and M methods. Then this will result in K * (L+N+M) operations, which is quadratic! Ouch!

The RDoc authors already tried to optimize this by only calling index_to_links when the current output filename’s directory is different from the last one. For example, if the last output file was classes/active_support/cache/memcache_store.html and the current output file is classes/active_support/cache/memory_store.html, then index_to_links doesn’t need to be called again, and its last result can be reused. The problem is that the output filename list isn’t sorted, and so a directory change is detected much more often that it needs to be.

I optimized this by sorting the list. This resulted in an 8% performance improvement.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22565&group_id=627&atid=2474

Reduce garbage and redundant operations in index_to_links

But the story doesn’t end there. index_to_links accepts two arguments: a filename and an array. index_to_links sorts its input array every time it is called. Remember that I said that index_to_links is O(n)? It’s not: it’s actually O(n log(n)) because of the sorting.

The arrays that are passed to index_to_links remains the same every time. So it’s very inefficient to keep sorting them over and over. I optimized this by:

  1. sorting the arrays only once.
  2. passing the sorted arrays to index_to_links.
  3. modifying index_to_links to not perform any sorting.
  4. modifying index_to_links to use in-place array manipulation methods as much as possible, to avoid generating unnecessary garbage objects.

The result is a 14% overall performance improvement.

This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22557&group_id=627&atid=2474

Failed: parser output caching

Seeing that the parser isn’t very fast, I thought about caching its output. So I modified RDoc in the following manner:

  1. It Marshal.dumps the parser output to a cache file.
  2. It consults the cache file whenever possible, instead of parsing the input file.

This made the parsing phase about 20% slower on a cold run, but 75% faster on a warm run.

Unfortunately, this approach totally fails on large projects. When run on the Ruby 1.8 standard library, it results in cache files that are 6 MB or larger. Loading such a file is much slower than parsing the original source file, and uses more memory too. My laptop crashed during this experiment because Ruby used 2 GB of memory. So I abandoned this effort.

Final words

Take the performance improvement numbers that I’ve given with a grain of salt. These numbers were taken by running RDoc on its own sources, under MRI 1.8. The performance gain really depends on the input. You’ve already seen the difference in performance improvement between running RDoc on the Ruby 1.8 standard library, and running it on ActiveRecord.

7 Comments »

  1. Joran said,

    October 28, 2008 @ 3:35 pm

    Great work, I don’t like waiting for rdoc.

  2. Hugo Baraúna said,

    October 28, 2008 @ 4:40 pm

    Very good work! =)

  3. Dan Kubb said,

    October 28, 2008 @ 7:29 pm

    Awesome work. How about using Erubis (if available) for the ERB output?

  4. Hongli said,

    October 28, 2008 @ 9:12 pm

    I doubt Erubis will help. It might be faster at compiling the templates, but not faster at evaluating.

  5. Double Shot #322 « A Fresh Cup said,

    October 29, 2008 @ 11:26 am

    [...] Optimizing RDoc – Hongli Lai attacks RDoc and wins. A good look at how to optimize Ruby code, too. I’m a full-time Rails developer and contributor, available for long- or short-term consulting, with solid experience in working as part of a distributed team. If you’d like to hire me, drop me a line. Links [...]

  6. roger said,

    November 10, 2008 @ 5:36 pm

    You may also want to submit the patches to http://redmine.ruby-lang.org/ so that you get more exposure to the ruby core team for them.
    Nice work–I too hate[d] how slow rdoc is :)
    -=R

  7. เร็วส์ หกสิบหก » นั่งเทียนเขียนข่าว#25 said,

    November 13, 2008 @ 6:30 pm

    [...] Optimizing RDoc [...]

RSS feed for comments on this post · TrackBack URI

Leave a Comment