When it comes to documentation, there’s always room for improvement. I’ve been contributing to docrails for quite a while now. However, one issue that has always annoyed me is how slow RDoc is.
In particular, generating RDoc output for ActiveRecord is painfully slow. It takes 34.2 seconds on my machine. Using JRuby helps a little, seeing that it can do JIT and all, but not much: it still takes 29 seconds (including JVM startup time). I had hoped for a bigger performance improvement.
So a few days ago I started optimizing RDoc. And not without result. Here are the results of generating RDoc for the Ruby 1.8 standard library with RDoc running in MRI 1.8:
Before optimizing: 1624.8 seconds (27 minutes). I had to close all applications because RDoc ate almost 1.5 GB of memory.
After optimizing: 531.2 seconds (8 minutes and 51 seconds). RDoc now uses only about half the memory it used to need.
Performance improvement: 206%!
The results for generating RDoc for ActiveRecord are:
Before optimizing (MRI 1.8): 34.2 seconds
After optimizing (MRI 1.8): 29.9 seconds
Performance improvement: 14%
Before optimizing (MRI 1.9): 16.8 seconds
After optimizing (MRI 1.9): 13.2 seconds
Performance improvement: 27%
Before optimizing (JRuby): 29.0 seconds
After optimizing (JRuby): 24.5 seconds
Performance improvement: 18%
So what did I do? Read on!
My laptop has a dual core CPU. It’s kind of a waste to see RDoc utilizing only a single core. RDoc can be parallelized by running 2 rdoc processes at the same time. But I’m never generating 2 different sets of RDoc documentation at the same time, so running 2 rdoc processes in parallel won’t do me any good. So I had to search for ways to parallelize a single rdoc process.
It turns out that RDoc’s code parsing phase is pretty easy to parallelize. It just creates a new parser object for every input file, and parses files sequentially. So I modified RDoc to create multiple worker threads during the parsing phase. The main thread will offer a list of filenames, and the worker threads will each consume filenames and parse the corresponding file as fast as they can.
What I did not try to do however, is making the parser itself multi-threaded. That’s just asking for problems. Those of you who are interested in multi-threaded programming, but aren’t experienced with it, should keep this in mind: try to keep your threading code as simple as possible, and make sure that your threads share as little data as possible. In my case, the only things that the threads share are the filename queue and the result aggregation array.
MRI 1.8 implements userspace threads, so having multiple threads doesn’t increase CPU core utilization. MRI 1.9 uses kernel threads, but it has a global interpreter lock, so it can’t utilize multiple CPU cores either. Luckily, JRuby doesn’t have a global interpreter lock. The parsing phase is now about 35% faster, when run on JRuby on my dual core machine.
This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22555&group_id=627&atid=2474
Cache template pages
For larger projects, the slowest phase of RDoc is probably the HTML generation phase. The latest version of RDoc uses ERB templates for HTML output. RDoc generates an HTML file for every class, every module and every source file. However after some profiling with ruby-prof, I found out that RDoc recompiles the ERB template for every output file, which is slow.
I modified RDoc so that it caches compiled ERB templates. This made it about 19% faster.
This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22556&group_id=627&atid=2474
Reduce the number of relative URL generations
For every output file, RDoc generates a list of URLs to this output file, relative to the output filenames of all classes, files and methods. This is done by calling the function
index_to_links, which runs in O(n) time. For every output file, index_to_links is called 3 times, each time with the full list of classes, files and methods. Suppose there are K output files, L classes, N files and M methods. Then this will result in K * (L+N+M) operations, which is quadratic! Ouch!
The RDoc authors already tried to optimize this by only calling index_to_links when the current output filename’s directory is different from the last one. For example, if the last output file was classes/active_support/cache/memcache_store.html and the current output file is classes/active_support/cache/memory_store.html, then index_to_links doesn’t need to be called again, and its last result can be reused. The problem is that the output filename list isn’t sorted, and so a directory change is detected much more often that it needs to be.
I optimized this by sorting the list. This resulted in an 8% performance improvement.
This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22565&group_id=627&atid=2474
Reduce garbage and redundant operations in index_to_links
But the story doesn’t end there. index_to_links accepts two arguments: a filename and an array. index_to_links sorts its input array every time it is called. Remember that I said that index_to_links is O(n)? It’s not: it’s actually O(n log(n)) because of the sorting.
The arrays that are passed to index_to_links remains the same every time. So it’s very inefficient to keep sorting them over and over. I optimized this by:
- sorting the arrays only once.
- passing the sorted arrays to index_to_links.
- modifying index_to_links to not perform any sorting.
- modifying index_to_links to use in-place array manipulation methods as much as possible, to avoid generating unnecessary garbage objects.
The result is a 14% overall performance improvement.
This patch has been submitted upstream: http://rubyforge.org/tracker/index.php?func=detail&aid=22557&group_id=627&atid=2474
Failed: parser output caching
Seeing that the parser isn’t very fast, I thought about caching its output. So I modified RDoc in the following manner:
Marshal.dumps the parser output to a cache file.
- It consults the cache file whenever possible, instead of parsing the input file.
This made the parsing phase about 20% slower on a cold run, but 75% faster on a warm run.
Unfortunately, this approach totally fails on large projects. When run on the Ruby 1.8 standard library, it results in cache files that are 6 MB or larger. Loading such a file is much slower than parsing the original source file, and uses more memory too. My laptop crashed during this experiment because Ruby used 2 GB of memory. So I abandoned this effort.
Take the performance improvement numbers that I’ve given with a grain of salt. These numbers were taken by running RDoc on its own sources, under MRI 1.8. The performance gain really depends on the input. You’ve already seen the difference in performance improvement between running RDoc on the Ruby 1.8 standard library, and running it on ActiveRecord.