From Siberia with love

The great ruby shootout

I used benchmark suite from ruby-1.9.3-p125. All tests run on:

Implementations:

JRuby was run with the --server -Xinvokedynamic.constants=true flags.

The compiler matters

From time to time, I see blog posts about improving ruby performance through applying some patches, but what if to go further and try to improve ruby performance by compiling it with the fastest available compiler? I decided to check this out.

Here are the list:

#!/bin/bash
compilers=( gcc gcc-4.2 gcc-4.7 clang )

for i in "${compilers[@]}"; do
  CC=$i ./configure --disable-install-doc --prefix ~/Projects/benches/mri/1.9.3-p125-$i
  time make -j4
  make install
done
$ ruby driver.rb -v -o ~/Projects/benches/compilers-bench.txt \
--executables='~/Projects/benches/mri/1.9.3-p125-gcc/bin/ruby;
               ~/Projects/benches/mri/1.9.3-p125-gcc-4.2/bin/ruby;
               ~/Projects/benches/mri/1.9.3-p125-gcc-4.7/bin/ruby;
               ~/Projects/benches/mri/1.9.3-p125-clang/bin/ruby'

Results:

mri-compilers

Oh, default llvm-gcc is ~20% slower (I run bench a couple of times and got similar results each time) than -pre version of gcc-4.7 in synthetic tests.

compile-time

To be sure that nothing broke with gcc-4.7:

PASS all 943 tests
KNOWNBUGS.rb .
PASS all 1 tests

Ok, I want to try

That is easy if you have homebrew installed:

$ brew install https://raw.github.com/etehtsea/formulary/009735e66ccabc5867331f64a406073d1623c683/Formula/gcc.rb --enable-cxx --enable-profiled-build --use-gcc

~ 1 hour later...

$ CC=gcc-4.7 ruby-build 1.9.3-p125 ~/.rbenv/versions/1.9.3-p125

What about %any other implementation%?

I couldn't stop on it and conducted by my curiosity have run the benchmark on other popular ruby implementations and MRI versions. I won't put complete logs, but only some highlights.

Don't use system ruby

It's a trap!

bm_vm_thread_mutex3.rb

# 1000 threads, one mutex

require 'thread'
m = Mutex.new
r = 0
max = 2000
(1..max).map{
  Thread.new{
    i=0
    while i<max
      i+=1
      m.synchronize{
        r += 1
      }
    end
  }
}.each{|e|
  e.join
}
raise r.to_s if r != max * max
$ time ~/.rbenv/versions/1.8.7-p357/bin/ruby bm_vm_thread_mutex3.rb
real	0m3.093s
user	0m3.078s
sys	0m0.013s
$ /usr/bin/ruby -v
ruby 1.8.7 (2011-12-28 patchlevel 357) [i686-darwin11.3.0]
$ time /usr/bin/ruby bm_vm_thread_mutex3.rb
^Cbm_vm_thread_mutex3.rb:18:in `join': Interrupt
	from bm_vm_thread_mutex3.rb:18
	from bm_vm_thread_mutex3.rb:7:in `each'
	from bm_vm_thread_mutex3.rb:7

real	3m54.930s
user	3m54.122s
sys	0m0.918s

Even if you sure that you don't use Threads elsewhere there are results without this test:

Failed on 1.8:

Rubinius 1.2.4 vs 2.0.0-dev

I've read that there is no GIL in 2.0.0-dev version and so on, but upcoming version is slower, and it's really slower.

The most slowdown is again in bm_vm_thread_mutext3.rb test:

Here are tests with big differences: result-rubinius

Total result without it:

Rubinius was ~15% slower.

Failed:

MacRuby 0.12 (Nightly)

MacRuby is what you need, when you want to write a desktop application for OS X or just use it's API, but there is no reason to use it from performance point.

First of all - MacRuby's eval (bm_vm2_eval.rb) is pretty slow:

bm_vm2_eval.rb

i=0
while i<6_000_000 # benchmark loop 2
  i+=1
  eval("1")
end

So as erb parsing and creation Class instances:

bm_app_erb.rb

#
# Create many HTML strings with ERB.
#

require 'erb'

data = DATA.read
max = 15_000
title = "hello world!"
content = "hello world!\n" * 10

max.times{
  ERB.new(data).result(binding)
}

__END__

<html>
  <head> <%= title %> </head>
  <body>
    <h1> <%= title %> </h1>
    <p>
      <%= content %>
    </p>
  </body>
</html>

bm_vm3_clearmethodcache.rb

i=0
while i<200_000
  i+=1

  Class.new{
    def m; end
  }
end

And other tests with big differences: result-macruby

Failed:

Maglev 1.0

Interesting that Maglev has similar problems:

result-rubinius

JRuby 1.6 vs 1.7.0-dev

JRuby 1.7.0-dev has similar performance to 1.6.6 version with significant improvement in bm_vm_thread_mutex3.rb bench:

Total result without it:

Failed:

MRI 2.0.0-dev vs 1.9.3-p125

Same situation with MRI dev branch. Just one improvement in bm_vm_thread_create_join.rb:

Total shootout

result-total-list1 result-total-list2

Total chart:

result-total-chart1

Chart without:

result-total-chart2

Looks competitive without them, doesn't it?

Update: Negative timings in vm1/vm2 tests? WTF?

This happened because of benchmark accuracy. Each test in these sections runs in while loop, so resulting time calculates like res_time = vm1/2_test_result - loop_whileloop1/2_result

bm_loop_whileloop.rb

i=0
while i<30_000_000 # benchmark loop 1
  i+=1
end

bm_vm1_const.rb

Const = 1

i = 0
while i<30_000_000 # while loop 1
  i+= 1
  j = Const
  k = Const
end

bm_vm1_ensure.rb

i=0
while i<30_000_000 # benchmark loop 1
  i+=1
  begin
    begin
    ensure
    end
  ensure
  end
end

Results for jruby 1.7.0.dev:

There are some proof code from driver.rb:

if /bm_loop_whileloop.rb/ =~ file
  @loop_wl1 = r[1].map{|e| e.min}
elsif /bm_loop_whileloop2.rb/ =~ file
  @loop_wl2 = r[1].map{|e| e.min}
end
output "name\t#{@execs.map{|(e, v)| v}.join("\t")}#{difference}"
@results.each{|v, result|
  rets = []
  s = nil
  result.each_with_index{|e, i|
    r = e.min
    case v
    when /^vm1_/
      if @loop_wl1
        r -= @loop_wl1[i]
        s = '*'
      end
    when /^vm2_/
      if @loop_wl2
        r -= @loop_wl2[i]
        s = '*'
      end
    end
    rets << sprintf("%.3f", r)
  }

P.S. Please correct me if I messed up somewhere (especially in English grammar).