Fork me on GitHub

From Siberia with love

How to find and fix bug in JRuby and don't go crazy

Hi there. I wrote this post to show how much effort is invested in small and simple one-line pull request. I tried to show the whole process from the appearance of the problem upon fixing it and sending patch back to upstream.

Other day I tried new ARM64 cloud hosting. When I tried to update RubyGems itself I’ve got strange error message:

kes@labs:~$ ~/jruby-9.1.8.0/bin/jruby --dev -J-noverify -S gem update --system
Updating rubygems-update
Fetching: rubygems-update-2.6.12.gem (100%)
Successfully installed rubygems-update-2.6.12
Installing RubyGems 2.6.12
io/console on JRuby shells out to stty for most operations
ERROR:  While executing gem ... (ArgumentError)
    could not find a temporary directory

This was really strange to see… I checked that /tmp directory exists:

kes@labs:~$ ls -al /
total 88
drwxr-xr-x  21 root root  4096 Apr 27 09:55 .
drwxr-xr-x  21 root root  4096 Apr 27 09:55 ..
<..>
drwxrwxrwt   9 root root 12288 May 10 08:55 tmp
<..>

On the next step I rerun the same command in VERBOSE and DEBUG modes, but this didn’t give any useful information for me. Then I tried to reproduce this error in REPL:

kes@labs:~$ ~/jruby-9.1.8.0/bin/jruby --dev -J-noverify -S irb
irb(main):001:0> require 'tmpdir'; Dir.tmpdir
ArgumentError: could not find a temporary directory
        from /home/kes/jruby-9.1.8.0/lib/ruby/stdlib/tmpdir.rb:35:in `tmpdir'
        from (irb):1:in `<eval>'
        from org/jruby/RubyKernel.java:1000:in `eval'
        from org/jruby/RubyKernel.java:1298:in `loop'
        from org/jruby/RubyKernel.java:1120:in `catch'
        from org/jruby/RubyKernel.java:1120:in `catch'
        from /home/kes/jruby-9.1.8.0/bin/jirb:13:in `<main>'

Aha! This was much better. I could go and look around the line where error arose:

require 'fileutils'
begin
  require 'etc.so'
rescue LoadError # rescue LoadError for miniruby
end

class Dir

  @@systmpdir ||= defined?(Etc.systmpdir) ? Etc.systmpdir : '/tmp'

  ##
  # Returns the operating system's temporary file path.

  def self.tmpdir
    if $SAFE > 0
      @@systmpdir.dup
    else
      tmp = nil
      [ENV['TMPDIR'], ENV['TMP'], ENV['TEMP'], @@systmpdir, '/tmp', '.'].each do |dir|
        next if !dir
        dir = File.expand_path(dir)
        if stat = File.stat(dir) and stat.directory? and stat.writable? and
            (!stat.world_writable? or stat.sticky?)
          tmp = dir
          break
        end rescue nil
      end
      raise ArgumentError, "could not find a temporary directory" unless tmp
      tmp
    end
  end
<..>

Somehow tmp directory couldn’t be found… Lets try it in REPL:

irb(main):002:0> require 'tmpdir'
=> true
irb(main):003:0> Etc.systmpdir
=> "/tmp"
irb(main):004:0> [ENV['TMPDIR'], ENV['TMP'], ENV['TEMP'], '/tmp'].map do |d|
irb(main):005:1*   next if !d
irb(main):006:1>   dir = File.expand_path(d)
irb(main):007:1>   stat = File.stat(dir)
irb(main):008:1>   [stat.directory?, stat.writable?, !stat.world_writable?, stat.sticky?]
irb(main):009:1> end
=> [nil, nil, nil, [false, false, true, false]]

It turns out that JRuby is thinking that /tmp is not a directry and this is looking like a bug. I tried to reproduce this bug on Linux x86_64 and on macOS, but it worked there as expected.

kes@etehtsea:~$ jruby-9.1.8.0/bin/jruby -S irb
require 'tmpdir'; Dir.tmpdirirb(main):001:0> require 'tmpdir'; Dir.tmpdir
=> "/tmp"
irb(main):002:0> File.stat('/tmp').directory?
=> true

Based on this it seems that this bug is platform-dependent. So we’ve got platform-dependent bug on “platform-independent” VM.

yey

To move on we need to look into File::Stat implementation:

@JRubyMethod(name = "directory?")
public RubyBoolean directory_p() {
    checkInitialized();
    return getRuntime().newBoolean(stat.isDirectory());
}

Okay… what is the value of stat variable? Apparently it’s an instance of jnr.posix.FileStat:

public class RubyFileStat extends RubyObject {
    <..>
    private FileStat stat;
    <..>
}

At this point we narrowed down our search area to jnr-posix dependency of JRuby. I recalled that before starting to debug, it would be good idea to try to run tests. Likely that nobody ever run them on AARCH64 platform.

Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.415 sec <<< FAILURE! - in jnr.posix.FileStatTest
filestatDirectory(jnr.posix.FileStatTest)  Time elapsed: 0.021 sec  <<< FAILURE!
java.lang.AssertionError: null
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.assertTrue(Assert.java:41)
        at org.junit.Assert.assertTrue(Assert.java:52)
        at jnr.posix.FileStatTest.filestatDirectory(FileStatTest.java:196)

This is rare situation when failing test is a good sign. Go and look into it:

@Test
public void filestatDirectory() throws Throwable {
    File f = File.createTempFile("stat", null).getParentFile();
    try {
        FileStat stat = posix.stat(f.getAbsolutePath());

        assertTrue(stat.isDirectory());
    } finally {
        f.delete();
    }
}

Wonderful, it seems that we have isolated test that covers our bug. At this point I could open issue in jnr-posix and jruby bug trackers. On the other hand wouldn’t it be better to try to fix this bug in an unfamiliar project written in language which I don’t use on a daily basis in addition with native calls done through FFI. I think everyone who wanna be better developer should from time to time go out of its comfort zone.

challenge-accepted

Lets look at isDirectory() method defined in BaseFileStat:

public boolean isDirectory() {
    return (mode() & S_IFMT) == S_IFDIR;
}

Here I interested in mode() value. The most simple way to get its value is to put println in this place.

public boolean isDirectory() {
    System.out.println("Mode: " + mode());
    return (mode() & S_IFMT) == S_IFDIR;
}

Rerun test:

<..>
Mode: 0
<..>

Obviously that mode() shouldn’t be equal to 0. For our platfrom this function is defined in LinuxFileStat64:

public int mode() {
    return layout.st_mode.get(memory);
}

Per se class LinuxFileStat is stat64 structure reproduced in FFI Struct. At this point I plunged into reading the source code and documentation.

Later on…

Actually struct64 is filling up by __xstat64 native call.

@Override
public int stat(String path, FileStat stat) {
    if (use_xstat64) {
        try {
            return ((LinuxLibC) libc()).__xstat64(statVersion, path, stat);
        } catch (UnsatisfiedLinkError ex) {
            use_xstat64 = false;
            return old_stat(path, stat);
        }
    } else {
        return old_stat(path, stat);
    }
}

For a start I checked that there is no problem in native code. For this I’ve got and slightly modified example code from wikipedia.

#define _LARGEFILE_SOURCE 1

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#include <sys/types.h>
#include <pwd.h>
#include <grp.h>
#include <sys/stat.h>
#include <unistd.h>

void print_bytes(const void *object, size_t size)
{
    const unsigned char * const bytes = object;
    size_t i;

    printf("[ ");
    for(i = 0; i < size; i++)
    {
        printf("%02x ", bytes[i]);
    }
    printf("]\n");
}

int main(int argc, char *argv[])
{
        struct stat sb;

        if (argc < 2)
        {
                fprintf(stderr, "Usage: %s: file ...\n", argv[0]);
                exit(EXIT_FAILURE);
        }

        for (int i = 1; i < argc; i++)
        {
                if (-1 == __xstat64(0, argv[i], &sb))
                {
                        perror("stat()");
                        exit(EXIT_FAILURE);
                }

                printf("%s:\n", argv[i]);
                printf("\tdev: %u\n", sb.st_dev);
                printf("\tinode: %u\n", sb.st_ino);
                printf("\tmode: %o\n", sb.st_mode);
                printf("\tlinks: %d\n", sb.st_nlink);
                printf("\towner: %u\n", sb.st_uid);
                printf("\tgroup: %u\n", sb.st_gid);
                printf("\trdev: %u\n", sb.st_rdev);
                printf("\tsize: %ld\n", sb.st_size);
                printf("\tblksize: %d\n", sb.st_blksize);
                printf("\tblkcnt: %ld\n", sb.st_blocks);
                printf("\tatime: %s", ctime(&sb.st_atim.tv_sec));
                printf("\tmtime: %s", ctime(&sb.st_mtim.tv_sec));
                printf("\tctime: %s", ctime(&sb.st_ctim.tv_sec));
                printf("\n");

                printf("mode_t: %d\n", sizeof(mode_t));
                printf("dev_t: %d\n", sizeof(dev_t));
                printf("ino_t: %d\n", sizeof(ino_t));
                printf("nlink_t: %d\n", sizeof(nlink_t));
                printf("uid_t: %d\n", sizeof(uid_t));
                printf("gid_t: %d\n", sizeof(gid_t));
                printf("off_t: %d\n", sizeof(off_t));
                printf("blksize_t: %d\n", sizeof(blksize_t));
                printf("blkcnt_t: %d\n", sizeof(blkcnt_t));
                printf("struct timespec: %d\n", sizeof(struct timespec));
                printf("struct size: %d\n", sizeof(sb));

                print_bytes(&sb, sizeof(sb));
        }

        return 0;
}

Compiled it and ran:

kes@labs:~$ cc -std=gnu11 xstat.c -o xstat && ./xstat /tmp
xstat.c: In function ‘main’:
xstat.c:39:13: warning: implicit declaration of function ‘__xstat64’ [-Wimplicit-function-declaration]
   if (-1 == __xstat64(0, argv[i], &sb))
             ^~~~~~~~~
/tmp:
        dev: 64768
        inode: 393217
        mode: 41777
        links: 9
        owner: 0
        group: 0
        rdev: 0
        size: 12288
        blksize: 4096
        blkcnt: 24
        atime: Wed May 10 08:58:36 2017
        mtime: Wed May 10 16:14:02 2017
        ctime: Wed May 10 16:14:02 2017

mode_t: 4
dev_t: 8
ino_t: 8
nlink_t: 4
uid_t: 4
gid_t: 4
off_t: 8
blksize_t: 4
blkcnt_t: 8
struct timespec: 16
struct size: 128
[ 00 fd 00 00 00 00 00 00 01 00 06 00 00 00 00 00 ff 43 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30 00 00 00 00 00 00 00 10 00 00 00 00 00 00 18 00 00 00 00 00 00 00 3c d6 12 59 00 00 00 00 0e d5 8c 10 00 00 00 00 4a 3c 13 59 00 00 00 00 73 e9 5b 16 00 00 00 00 4a 3c 13 59 00 00 00 00 73 e9 5b 16 00 00 00 00 00 00 00 00 00 00 00 00 ]

Thus native code isn’t the root of the issue. To compare I compiled and ran the same example on Linux x86_64:

/tmp:
	dev: 31
	inode: 9093
	mode: 41777
	links: 9
	owner: 0
	group: 0
	rdev: 0
	size: 680
	blksize: 4096
	blkcnt: 0
	atime: Wed May 10 08:36:05 2017
	mtime: Wed May 10 19:16:45 2017
	ctime: Wed May 10 19:16:45 2017

mode_t: 4
dev_t: 8
ino_t: 8
nlink_t: 8
uid_t: 4
gid_t: 4
off_t: 8
blksize_t: 8
blkcnt_t: 8
struct timespec: 16
struct size: 144
[ 1f 00 00 00 00 00 00 00 85 23 00 00 00 00 00 00 09 00 00 00 00 00 00 00 ff 43 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 a8 02 00 00 00 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c5 a6 12 59 00 00 00 00 1c 6d fa 06 00 00 00 00 ed 3c 13 59 00 00 00 00 7e 04 81 2c 00 00 00 00 ed 3c 13 59 00 00 00 00 7e 04 81 2c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ]

Noticeable difference is structure size. This can be the cause of the bug because we have only one FFI mapping for all Linux 64 platforms. To confirm this I made similar program in Java:

import java.io.File;
import jnr.ffi.*;
import jnr.ffi.Runtime;
import jnr.ffi.annotations.Out;
import jnr.ffi.annotations.Transient;

public class HelloWorld {
    public static interface LibC {
        int puts(String s);
        int __xstat64(int statVersion, CharSequence path, @Out @Transient SB stat);
    }

    public static final class SB extends Struct {
        public SB(Runtime runtime) {
            super(runtime);
        }

        public final Signed64 st_dev = new Signed64();
        public final Signed64 st_ino = new Signed64();
        public final Signed64 st_nlink = new Signed64();
        public final Signed32 st_mode = new Signed32();
        public final Signed32 st_uid = new Signed32();
        public final Signed32 st_gid = new Signed32();
        public final Signed64 st_rdev = new Signed64();
        public final Signed64 st_size = new Signed64();
        public final Signed64 st_blksize = new Signed64();
        public final Signed64 st_blocks = new Signed64();
        public final Signed64 st_atime = new Signed64();     // Time of last access (time_t)
        public final Signed64 st_atimensec = new Signed64(); // Time of last access (nanoseconds)
        public final Signed64 st_mtime = new Signed64();     // Last data modification time (time_t)
        public final Signed64 st_mtimensec = new Signed64(); // Last data modification time (nanoseconds)
        public final Signed64 st_ctime = new Signed64();     // Time of last status change (time_t)
        public final Signed64 st_ctimensec = new Signed64(); // Time of last status change (nanoseconds)
        public final Signed64 __unused4 = new Signed64();
        public final Signed64 __unused5 = new Signed64();
        public final Signed64 __unused6 = new Signed64();
    }

    public static void main(String[] args) throws Throwable {
        LibC libc = LibraryLoader.create(LibC.class).load("c");
        File f = File.createTempFile("stat", null).getParentFile();
        Runtime runtime = Runtime.getRuntime(libc);
        SB stat = new SB(runtime);
        System.out.println("Struct size: " + Struct.size(stat));

        try {
            System.out.println(f.getAbsolutePath());
            int ret = libc.__xstat64(0, f.getAbsolutePath(), stat);
            System.out.println("Finished with: " + ret);

            System.out.println("\tdev: " + stat.st_dev.get());
            System.out.println("\tinode: " + stat.st_ino.get());
            System.out.println("\tmode: " + stat.st_mode.get());
            System.out.println("\tlinks: " + stat.st_nlink.get());
            System.out.println("\towner: " + stat.st_uid.get());
            System.out.println("\tgroup: " + stat.st_gid.get());
            System.out.println("\trdev: " + stat.st_rdev.get());
            System.out.println("\tsize: " + stat.st_size.get());
            System.out.println("\tblksize: " + stat.st_blksize.get());
            System.out.println("\tblkcnt: " + stat.st_blocks.get());
        } finally {
            f.delete();
        }
        libc.puts("Hello, World");
    }
}

Output on AARCH64:

kes@labs:~$ javac -cp jnr-ffi-2.1.5.jar HelloWorld.java && java -cp asm-4.0_RC2.jar:jffi-1.2.15.jar:jnr-ffi-2.1.5.jar:. HelloWorld
Struct size: 144
/tmp
Finished with: 0
        dev: 64768
        inode: 393217
        mode: 0
        links: 38654723071
        owner: 0
        group: 0
        rdev: 0
        size: 12288
        blksize: 4096
        blkcnt: 24
Hello, World

mode equals to 0 and links field contains garbage value. Maybe stat64 has different structure on AARCH64 platform. What could be easier than to find in docs its definition and repeat it in jnr-posix? In fact I didn’t find any formal specification. There is description in man 2 stat, but it don’t count padding that could be made by compiler. Because I don’t have any practice in C the best idea that I had at this point is to print it as byte array and to analyse it in hex editor in order to understand its layout.

I shared a problem with my friend who is C programmer and he gave me a hint about pahole utility which shows data structure layout. To get it we need to recompile our program with -g flag specified. It includes debug information in compiled binary.

kes@labs:~$ pahole xstat
<..>
struct stat {
        __dev_t                    st_dev;               /*     0     8 */
        __ino_t                    st_ino;               /*     8     8 */
        __mode_t                   st_mode;              /*    16     4 */
        __nlink_t                  st_nlink;             /*    20     4 */
        __uid_t                    st_uid;               /*    24     4 */
        __gid_t                    st_gid;               /*    28     4 */
        __dev_t                    st_rdev;              /*    32     8 */
        __dev_t                    __pad1;               /*    40     8 */
        __off_t                    st_size;              /*    48     8 */
        __blksize_t                st_blksize;           /*    56     4 */
        int                        __pad2;               /*    60     4 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        __blkcnt_t                 st_blocks;            /*    64     8 */
        struct timespec            st_atim;              /*    72    16 */
        struct timespec            st_mtim;              /*    88    16 */
        struct timespec            st_ctim;              /*   104    16 */
        int                        __glibc_reserved[2];  /*   120     8 */
        /* --- cacheline 2 boundary (128 bytes) --- */

        /* size: 128, cachelines: 2, members: 16 */
};

This is awesome! Based on this info I repeated this struct in FFI struct layout for AARCH64 platform. Lets see what changed in our test program output:

kes@labs:~$ javac -cp jnr-ffi-2.1.5.jar HelloWorld.java && java -cp asm-4.0_RC2.jar:jffi-1.2.15.jar:jnr-ffi-2.1.5.jar:. HelloWorld
Struct size: 128
/tmp
Finished with: 0
        dev: 64768
        inode: 393217
        mode: 17407
        links: 9
        owner: 0
        group: 0
        rdev: 0
        size: 12288
        blksize: 4096
        blkcnt: 24
Hello, World

It works! Then it was needed to check that failing test in jnr-posix now passes and error in JRuby doesn’t reproduce anymore. As a result I opened pull requests in jnr-posix and JRuby projects.

all_folks

Update:

Aha, not so fast! One thing kept bothering me: when I ran pahole on x86_64 platform I saw that it also has padded field in its layout.

struct stat {
	__dev_t                    st_dev;               /*     0     8 */
	__ino_t                    st_ino;               /*     8     8 */
	__nlink_t                  st_nlink;             /*    16     8 */
	__mode_t                   st_mode;              /*    24     4 */
	__uid_t                    st_uid;               /*    28     4 */
	__gid_t                    st_gid;               /*    32     4 */
	int                        __pad0;               /*    36     4 */ /* <------ WAT /*
	__dev_t                    st_rdev;              /*    40     8 */
	__off_t                    st_size;              /*    48     8 */
	__blksize_t                st_blksize;           /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	__blkcnt_t                 st_blocks;            /*    64     8 */
	struct timespec            st_atim;              /*    72    16 */
	struct timespec            st_mtim;              /*    88    16 */
	struct timespec            st_ctim;              /*   104    16 */
	__syscall_slong_t          __glibc_reserved[3];  /*   120    24 */
	/* --- cacheline 2 boundary (128 bytes) was 16 bytes ago --- */

	/* size: 144, cachelines: 3, members: 15 */
	/* last cacheline: 16 bytes */
};

But there is a test that checks size and it passes.

wat

As it turned out I’m not the only person who has such confusion. jnr-posix maintainer also enebo asked me about this. cellscape clued me that jnr.ffi.StructLayout is doing alignment automatically.

After that I made final version of layout and pull request was merged.

blog comments powered byDisqus