How to find and fix bug in JRuby and don't go crazy
Hi there. I wrote this post to show how much effort is invested in small and simple one-line pull request. I tried to show the whole process from the appearance of the problem upon fixing it and sending patch back to upstream.
Other day I tried new ARM64 cloud hosting. When I tried to update RubyGems itself
I’ve got strange error message:
kes@labs:~$ ~/jruby-9.1.8.0/bin/jruby --dev -J-noverify -S gem update --system
Updating rubygems-update
Fetching: rubygems-update-2.6.12.gem (100%)
Successfully installed rubygems-update-2.6.12
Installing RubyGems 2.6.12
io/console on JRuby shells out to stty for most operations
ERROR: While executing gem ... (ArgumentError)
could not find a temporary directory
This was really strange to see… I checked that /tmp directory exists:
kes@labs:~$ ls -al /
total 88
drwxr-xr-x 21 root root 4096 Apr 27 09:55 .
drwxr-xr-x 21 root root 4096 Apr 27 09:55 ..
<..>
drwxrwxrwt 9 root root 12288 May 10 08:55 tmp
<..>
On the next step I rerun the same command in VERBOSE and DEBUG modes, but this didn’t
give any useful information for me. Then I tried to reproduce this error in REPL:
kes@labs:~$ ~/jruby-9.1.8.0/bin/jruby --dev -J-noverify -S irb
irb(main):001:0> require 'tmpdir'; Dir.tmpdir
ArgumentError: could not find a temporary directory
from /home/kes/jruby-9.1.8.0/lib/ruby/stdlib/tmpdir.rb:35:in `tmpdir'
from (irb):1:in `<eval>'
from org/jruby/RubyKernel.java:1000:in `eval'
from org/jruby/RubyKernel.java:1298:in `loop'
from org/jruby/RubyKernel.java:1120:in `catch'
from org/jruby/RubyKernel.java:1120:in `catch'
from /home/kes/jruby-9.1.8.0/bin/jirb:13:in `<main>'
Aha! This was much better. I could go and look around the line where error arose:
require 'fileutils'
begin
require 'etc.so'
rescue LoadError # rescue LoadError for miniruby
end
class Dir
@@systmpdir ||= defined?(Etc.systmpdir) ? Etc.systmpdir : '/tmp'
##
# Returns the operating system's temporary file path.
def self.tmpdir
if $SAFE > 0
@@systmpdir.dup
else
tmp = nil
[ENV['TMPDIR'], ENV['TMP'], ENV['TEMP'], @@systmpdir, '/tmp', '.'].each do |dir|
next if !dir
dir = File.expand_path(dir)
if stat = File.stat(dir) and stat.directory? and stat.writable? and
(!stat.world_writable? or stat.sticky?)
tmp = dir
break
end rescue nil
end
raise ArgumentError, "could not find a temporary directory" unless tmp
tmp
end
end
<..>
Somehow tmp directory couldn’t be found… Lets try it in REPL:
irb(main):002:0> require 'tmpdir'
=> true
irb(main):003:0> Etc.systmpdir
=> "/tmp"
irb(main):004:0> [ENV['TMPDIR'], ENV['TMP'], ENV['TEMP'], '/tmp'].map do |d|
irb(main):005:1* next if !d
irb(main):006:1> dir = File.expand_path(d)
irb(main):007:1> stat = File.stat(dir)
irb(main):008:1> [stat.directory?, stat.writable?, !stat.world_writable?, stat.sticky?]
irb(main):009:1> end
=> [nil, nil, nil, [false, false, true, false]]
It turns out that JRuby is thinking that /tmp is not a directry and this is looking like a bug.
I tried to reproduce this bug on Linux x86_64 and on macOS, but it worked there as expected.
kes@etehtsea:~$ jruby-9.1.8.0/bin/jruby -S irb
require 'tmpdir'; Dir.tmpdirirb(main):001:0> require 'tmpdir'; Dir.tmpdir
=> "/tmp"
irb(main):002:0> File.stat('/tmp').directory?
=> true
Based on this it seems that this bug is platform-dependent. So we’ve got platform-dependent bug on “platform-independent” VM.

To move on we need to look into File::Stat implementation:
@JRubyMethod(name = "directory?")
public RubyBoolean directory_p() {
checkInitialized();
return getRuntime().newBoolean(stat.isDirectory());
}
Okay… what is the value of stat variable? Apparently it’s an instance of jnr.posix.FileStat:
public class RubyFileStat extends RubyObject {
<..>
private FileStat stat;
<..>
}
At this point we narrowed down our search area to jnr-posix dependency of JRuby.
I recalled that before starting to debug, it would be good idea to try to run tests.
Likely that nobody ever run them on AARCH64 platform.
Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.415 sec <<< FAILURE! - in jnr.posix.FileStatTest
filestatDirectory(jnr.posix.FileStatTest) Time elapsed: 0.021 sec <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at jnr.posix.FileStatTest.filestatDirectory(FileStatTest.java:196)
This is rare situation when failing test is a good sign. Go and look into it:
@Test
public void filestatDirectory() throws Throwable {
File f = File.createTempFile("stat", null).getParentFile();
try {
FileStat stat = posix.stat(f.getAbsolutePath());
assertTrue(stat.isDirectory());
} finally {
f.delete();
}
}
Wonderful, it seems that we have isolated test that covers our bug. At this point
I could open issue in jnr-posix and jruby bug trackers. On the other hand
wouldn’t it be better to try to fix this bug in an unfamiliar project written in language which
I don’t use on a daily basis in addition with native calls done through FFI. I think
everyone who wanna be better developer should from time to time go out of its comfort zone.

Lets look at isDirectory() method defined in BaseFileStat:
public boolean isDirectory() {
return (mode() & S_IFMT) == S_IFDIR;
}
Here I interested in mode() value. The most simple way to get its value is to put
println in this place.
public boolean isDirectory() {
System.out.println("Mode: " + mode());
return (mode() & S_IFMT) == S_IFDIR;
}
Rerun test:
<..>
Mode: 0
<..>
Obviously that mode() shouldn’t be equal to 0. For our platfrom this function
is defined in LinuxFileStat64:
public int mode() {
return layout.st_mode.get(memory);
}
Per se class LinuxFileStat is stat64 structure reproduced in FFI Struct.
At this point I plunged into reading the source code and documentation.
Later on…
Actually struct64 is filling up by __xstat64 native call.
@Override
public int stat(String path, FileStat stat) {
if (use_xstat64) {
try {
return ((LinuxLibC) libc()).__xstat64(statVersion, path, stat);
} catch (UnsatisfiedLinkError ex) {
use_xstat64 = false;
return old_stat(path, stat);
}
} else {
return old_stat(path, stat);
}
}
For a start I checked that there is no problem in native code. For this I’ve got and slightly modified example code from wikipedia.
#define _LARGEFILE_SOURCE 1
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/types.h>
#include <pwd.h>
#include <grp.h>
#include <sys/stat.h>
#include <unistd.h>
void print_bytes(const void *object, size_t size)
{
const unsigned char * const bytes = object;
size_t i;
printf("[ ");
for(i = 0; i < size; i++)
{
printf("%02x ", bytes[i]);
}
printf("]\n");
}
int main(int argc, char *argv[])
{
struct stat sb;
if (argc < 2)
{
fprintf(stderr, "Usage: %s: file ...\n", argv[0]);
exit(EXIT_FAILURE);
}
for (int i = 1; i < argc; i++)
{
if (-1 == __xstat64(0, argv[i], &sb))
{
perror("stat()");
exit(EXIT_FAILURE);
}
printf("%s:\n", argv[i]);
printf("\tdev: %u\n", sb.st_dev);
printf("\tinode: %u\n", sb.st_ino);
printf("\tmode: %o\n", sb.st_mode);
printf("\tlinks: %d\n", sb.st_nlink);
printf("\towner: %u\n", sb.st_uid);
printf("\tgroup: %u\n", sb.st_gid);
printf("\trdev: %u\n", sb.st_rdev);
printf("\tsize: %ld\n", sb.st_size);
printf("\tblksize: %d\n", sb.st_blksize);
printf("\tblkcnt: %ld\n", sb.st_blocks);
printf("\tatime: %s", ctime(&sb.st_atim.tv_sec));
printf("\tmtime: %s", ctime(&sb.st_mtim.tv_sec));
printf("\tctime: %s", ctime(&sb.st_ctim.tv_sec));
printf("\n");
printf("mode_t: %d\n", sizeof(mode_t));
printf("dev_t: %d\n", sizeof(dev_t));
printf("ino_t: %d\n", sizeof(ino_t));
printf("nlink_t: %d\n", sizeof(nlink_t));
printf("uid_t: %d\n", sizeof(uid_t));
printf("gid_t: %d\n", sizeof(gid_t));
printf("off_t: %d\n", sizeof(off_t));
printf("blksize_t: %d\n", sizeof(blksize_t));
printf("blkcnt_t: %d\n", sizeof(blkcnt_t));
printf("struct timespec: %d\n", sizeof(struct timespec));
printf("struct size: %d\n", sizeof(sb));
print_bytes(&sb, sizeof(sb));
}
return 0;
}
Compiled it and ran:
kes@labs:~$ cc -std=gnu11 xstat.c -o xstat && ./xstat /tmp
xstat.c: In function ‘main’:
xstat.c:39:13: warning: implicit declaration of function ‘__xstat64’ [-Wimplicit-function-declaration]
if (-1 == __xstat64(0, argv[i], &sb))
^~~~~~~~~
/tmp:
dev: 64768
inode: 393217
mode: 41777
links: 9
owner: 0
group: 0
rdev: 0
size: 12288
blksize: 4096
blkcnt: 24
atime: Wed May 10 08:58:36 2017
mtime: Wed May 10 16:14:02 2017
ctime: Wed May 10 16:14:02 2017
mode_t: 4
dev_t: 8
ino_t: 8
nlink_t: 4
uid_t: 4
gid_t: 4
off_t: 8
blksize_t: 4
blkcnt_t: 8
struct timespec: 16
struct size: 128
[ 00 fd 00 00 00 00 00 00 01 00 06 00 00 00 00 00 ff 43 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30 00 00 00 00 00 00 00 10 00 00 00 00 00 00 18 00 00 00 00 00 00 00 3c d6 12 59 00 00 00 00 0e d5 8c 10 00 00 00 00 4a 3c 13 59 00 00 00 00 73 e9 5b 16 00 00 00 00 4a 3c 13 59 00 00 00 00 73 e9 5b 16 00 00 00 00 00 00 00 00 00 00 00 00 ]
Thus native code isn’t the root of the issue. To compare I compiled and ran the same example
on Linux x86_64:
/tmp:
dev: 31
inode: 9093
mode: 41777
links: 9
owner: 0
group: 0
rdev: 0
size: 680
blksize: 4096
blkcnt: 0
atime: Wed May 10 08:36:05 2017
mtime: Wed May 10 19:16:45 2017
ctime: Wed May 10 19:16:45 2017
mode_t: 4
dev_t: 8
ino_t: 8
nlink_t: 8
uid_t: 4
gid_t: 4
off_t: 8
blksize_t: 8
blkcnt_t: 8
struct timespec: 16
struct size: 144
[ 1f 00 00 00 00 00 00 00 85 23 00 00 00 00 00 00 09 00 00 00 00 00 00 00 ff 43 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a8 02 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c5 a6 12 59 00 00 00 00 1c 6d fa 06 00 00 00 00 ed 3c 13 59 00 00 00 00 7e 04 81 2c 00 00 00 00 ed 3c 13 59 00 00 00 00 7e 04 81 2c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ]
Noticeable difference is structure size. This can be the cause of the bug because we have only one
FFI mapping for all Linux 64 platforms. To confirm this I made similar program in Java:
import java.io.File;
import jnr.ffi.*;
import jnr.ffi.Runtime;
import jnr.ffi.annotations.Out;
import jnr.ffi.annotations.Transient;
public class HelloWorld {
public static interface LibC {
int puts(String s);
int __xstat64(int statVersion, CharSequence path, @Out @Transient SB stat);
}
public static final class SB extends Struct {
public SB(Runtime runtime) {
super(runtime);
}
public final Signed64 st_dev = new Signed64();
public final Signed64 st_ino = new Signed64();
public final Signed64 st_nlink = new Signed64();
public final Signed32 st_mode = new Signed32();
public final Signed32 st_uid = new Signed32();
public final Signed32 st_gid = new Signed32();
public final Signed64 st_rdev = new Signed64();
public final Signed64 st_size = new Signed64();
public final Signed64 st_blksize = new Signed64();
public final Signed64 st_blocks = new Signed64();
public final Signed64 st_atime = new Signed64(); // Time of last access (time_t)
public final Signed64 st_atimensec = new Signed64(); // Time of last access (nanoseconds)
public final Signed64 st_mtime = new Signed64(); // Last data modification time (time_t)
public final Signed64 st_mtimensec = new Signed64(); // Last data modification time (nanoseconds)
public final Signed64 st_ctime = new Signed64(); // Time of last status change (time_t)
public final Signed64 st_ctimensec = new Signed64(); // Time of last status change (nanoseconds)
public final Signed64 __unused4 = new Signed64();
public final Signed64 __unused5 = new Signed64();
public final Signed64 __unused6 = new Signed64();
}
public static void main(String[] args) throws Throwable {
LibC libc = LibraryLoader.create(LibC.class).load("c");
File f = File.createTempFile("stat", null).getParentFile();
Runtime runtime = Runtime.getRuntime(libc);
SB stat = new SB(runtime);
System.out.println("Struct size: " + Struct.size(stat));
try {
System.out.println(f.getAbsolutePath());
int ret = libc.__xstat64(0, f.getAbsolutePath(), stat);
System.out.println("Finished with: " + ret);
System.out.println("\tdev: " + stat.st_dev.get());
System.out.println("\tinode: " + stat.st_ino.get());
System.out.println("\tmode: " + stat.st_mode.get());
System.out.println("\tlinks: " + stat.st_nlink.get());
System.out.println("\towner: " + stat.st_uid.get());
System.out.println("\tgroup: " + stat.st_gid.get());
System.out.println("\trdev: " + stat.st_rdev.get());
System.out.println("\tsize: " + stat.st_size.get());
System.out.println("\tblksize: " + stat.st_blksize.get());
System.out.println("\tblkcnt: " + stat.st_blocks.get());
} finally {
f.delete();
}
libc.puts("Hello, World");
}
}
Output on AARCH64:
kes@labs:~$ javac -cp jnr-ffi-2.1.5.jar HelloWorld.java && java -cp asm-4.0_RC2.jar:jffi-1.2.15.jar:jnr-ffi-2.1.5.jar:. HelloWorld
Struct size: 144
/tmp
Finished with: 0
dev: 64768
inode: 393217
mode: 0
links: 38654723071
owner: 0
group: 0
rdev: 0
size: 12288
blksize: 4096
blkcnt: 24
Hello, World
mode equals to 0 and links field contains garbage value. Maybe stat64 has
different structure on AARCH64 platform. What could be easier than to find in docs its
definition and repeat it in jnr-posix? In fact I didn’t find any formal specification.
There is description in man 2 stat, but it don’t count padding that could be made
by compiler. Because I don’t have any practice in C the best idea that I had at this point is to
print it as byte array and to analyse it in hex editor in order to understand its layout.
I shared a problem with my friend who is C programmer and he gave me a hint about
pahole utility which shows data structure layout. To get it we need to recompile
our program with -g flag specified. It includes debug information in compiled binary.
kes@labs:~$ pahole xstat
<..>
struct stat {
__dev_t st_dev; /* 0 8 */
__ino_t st_ino; /* 8 8 */
__mode_t st_mode; /* 16 4 */
__nlink_t st_nlink; /* 20 4 */
__uid_t st_uid; /* 24 4 */
__gid_t st_gid; /* 28 4 */
__dev_t st_rdev; /* 32 8 */
__dev_t __pad1; /* 40 8 */
__off_t st_size; /* 48 8 */
__blksize_t st_blksize; /* 56 4 */
int __pad2; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
__blkcnt_t st_blocks; /* 64 8 */
struct timespec st_atim; /* 72 16 */
struct timespec st_mtim; /* 88 16 */
struct timespec st_ctim; /* 104 16 */
int __glibc_reserved[2]; /* 120 8 */
/* --- cacheline 2 boundary (128 bytes) --- */
/* size: 128, cachelines: 2, members: 16 */
};
This is awesome! Based on this info I repeated this struct in FFI struct layout
for AARCH64 platform. Lets see what changed in our test program output:
kes@labs:~$ javac -cp jnr-ffi-2.1.5.jar HelloWorld.java && java -cp asm-4.0_RC2.jar:jffi-1.2.15.jar:jnr-ffi-2.1.5.jar:. HelloWorld
Struct size: 128
/tmp
Finished with: 0
dev: 64768
inode: 393217
mode: 17407
links: 9
owner: 0
group: 0
rdev: 0
size: 12288
blksize: 4096
blkcnt: 24
Hello, World
It works!
Then it was needed to check that failing test in jnr-posix now passes and error in JRuby
doesn’t reproduce anymore. As a result I opened pull requests in jnr-posix and JRuby projects.

Update:
Aha, not so fast! One thing kept bothering me: when I ran pahole on x86_64 platform
I saw that it also has padded field in its layout.
struct stat {
__dev_t st_dev; /* 0 8 */
__ino_t st_ino; /* 8 8 */
__nlink_t st_nlink; /* 16 8 */
__mode_t st_mode; /* 24 4 */
__uid_t st_uid; /* 28 4 */
__gid_t st_gid; /* 32 4 */
int __pad0; /* 36 4 */ /* <------ WAT /*
__dev_t st_rdev; /* 40 8 */
__off_t st_size; /* 48 8 */
__blksize_t st_blksize; /* 56 8 */
/* --- cacheline 1 boundary (64 bytes) --- */
__blkcnt_t st_blocks; /* 64 8 */
struct timespec st_atim; /* 72 16 */
struct timespec st_mtim; /* 88 16 */
struct timespec st_ctim; /* 104 16 */
__syscall_slong_t __glibc_reserved[3]; /* 120 24 */
/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */
/* size: 144, cachelines: 3, members: 15 */
/* last cacheline: 16 bytes */
};
But there is a test that checks size and it passes.

As it turned out I’m not the only person who has such confusion. jnr-posix maintainer also
enebo asked me about this. cellscape clued me that
jnr.ffi.StructLayout is doing alignment automatically.
After that I made final version of layout and pull request was merged.
blog comments powered byDisqus